MiMo-V2.5-Pro-UltraSpeed
The UltraSpeed experience mode of MiMo-V2.5-Pro — a trillion-parameter (1T) flagship model reaching inference speeds of up to 1000 tokens/s, built for the most demanding real-time scenarios.
Limited capacity with daily approvals, prioritized for professional organizations. Apply now
Model Specifications
Modality
Model Capabilities
Pricing
3x the price advantage of MiMo-V2.5-Pro, 10x the output experience. MiMo-V2.5-Pro-UltraSpeed limited-time trial price:
Recommended Scenarios
Quantitative Trading
When breaking news drops, the model analyzes market impact and generates trading signals within milliseconds — closing the decision loop before the market moves for truly low-latency quantitative response.
Real-time Risk Control
Complete complex fraud reasoning and risk assessment within hundreds of milliseconds before settlement. Break past the limits of traditional rule engines, balancing real-time speed with deep reasoning.
Scientific Research
Power instant generation and validation of large-scale hypotheses, cutting human-machine latency to near real-time. Eliminate waiting gaps and keep researchers' thinking continuous.
Real-time Coding Assistance
Deliver code generation that outpaces reading speed for zero-perceived-latency completion. Complex refactors finish in an instant, meaningfully improving development continuity.
Inference Experience
MiMo-V2.5-Pro-UltraSpeed vs MiMo-V2.5-Pro Inference Speed Comparison
Instant 3D Racing Game Built with Three.js
Integration
import os
from openai import OpenAI
client = OpenAI(
api_key=os.environ.get("MIMO_API_KEY"),
base_url="https://api.xiaomimimo.com/v1"
)
completion = client.chat.completions.create(
model="mimo-v2.5-pro-ultraspeed",
messages=[
{
"role": "system",
"content": "You are MiMo, an AI assistant developed by Xiaomi. Today is date: Tuesday, December 16, 2025. Your knowledge cutoff date is December 2024."
},
{
"role": "user",
"content": [
{
"type": "text",
"text": "Generate a modern-style SaaS landing page as a single file containing only HTML, CSS and JavaScript."
}
]
}
],
max_completion_tokens=131072,
stream=True
)
print("\n========== [Thinking Content] ==========\n")
answering = False
for chunk in completion:
if not chunk.choices:
continue
delta = chunk.choices[0].delta
if hasattr(delta, "reasoning_content") and delta.reasoning_content:
print(delta.reasoning_content, end='', flush=True)
if hasattr(delta, "content") and delta.content:
if not answering:
print("\n\n========== [Answer Content] ==========\n")
answering = True
print(delta.content, end='', flush=True)1T × 1000 tokens/s
MiMo Algorithm Innovation
FP4 Mixed-Precision Quantization
FP4 quantization is applied only to MoE Experts while everything else keeps its original precision. FP4 QAT drastically shrinks model size and maximizes hardware bandwidth while preserving model capability at a near-lossless level.
DFlash Speculative Decoding
Replaces traditional autoregressive drafting with block-level masked parallel prediction. The draft model uses SWA to reduce prediction compute to a constant level, paired with the Muon optimizer and self-distillation for high acceptance rates — translating directly into substantial inference throughput gains.
TileRT System-level Optimization
Persistent Kernel Engine
Abandons per-operator launches; the compute pipeline stays resident on the GPU and runs continuously, with full-path prefetching to maximize overlap between data movement and computation.
Heterogeneous Pipeline Collaboration
Tile-level decomposition assigns communication, data movement, and tensor computation to different warps, evolving the GPU into a continuously flowing, precisely coordinated heterogeneous execution system.
Breaking through 1000 tokens/s for the first time without sacrificing intelligence or requiring custom silicon. Xiaomi has shattered the industry's impossible triangle — that "fast, powerful, and general-purpose GPU can't coexist" — the inevitable result of cutting-edge algorithms and system infrastructure deeply converging and co-evolving toward each other.
1T × 1000 tokens/s
MiMo Algorithm Innovation
FP4 Mixed-Precision Quantization
FP4 quantization is applied only to MoE Experts while everything else keeps its original precision. FP4 QAT drastically shrinks model size and maximizes hardware bandwidth while preserving model capability at a near-lossless level.
DFlash Speculative Decoding
Replaces traditional autoregressive drafting with block-level masked parallel prediction. The draft model uses SWA to reduce prediction compute to a constant level, paired with the Muon optimizer and self-distillation for high acceptance rates — translating directly into substantial inference throughput gains.
TileRT System-level Optimization
Persistent Kernel Engine
Abandons per-operator launches; the compute pipeline stays resident on the GPU and runs continuously, with full-path prefetching to maximize overlap between data movement and computation.
Heterogeneous Pipeline Collaboration
Tile-level decomposition assigns communication, data movement, and tensor computation to different warps, evolving the GPU into a continuously flowing, precisely coordinated heterogeneous execution system.
Breaking through 1000 tokens/s for the first time without sacrificing intelligence or requiring custom silicon. Xiaomi has shattered the industry's impossible triangle — that "fast, powerful, and general-purpose GPU can't coexist" — the inevitable result of cutting-edge algorithms and system infrastructure deeply converging and co-evolving toward each other.
Try MiMo-2.5-Pro-UltraSpeed Now
No code needed — experience 1000TPS ultra-fast inference right in your browser