MiMo-V2.5-Pro-UltraSpeed

The UltraSpeed experience mode of MiMo-V2.5-Pro — a trillion-parameter (1T) flagship model reaching inference speeds of up to 1000 tokens/s, built for the most demanding real-time scenarios.

Limited capacity with daily approvals, prioritized for professional organizations. Apply now

Model Specifications

Modality

InputText
OutputText

Model Capabilities

Ultra-fast Inference
Deep Thinking
Tool Calling
Streaming Output

Pricing

3x the price advantage of MiMo-V2.5-Pro, 10x the output experience. MiMo-V2.5-Pro-UltraSpeed limited-time trial price:

MiMo-V2.5-Pro-UltraSpeed
MiMo-V2.5-Pro
Input (Cache Hit)
¥ 0.075CNY / Million tokens
¥ 0.025CNY / Million tokens
Input (Cache Miss)
¥ 9CNY / Million tokens
¥ 3CNY / Million tokens
Output
¥ 18CNY / Million tokens
¥ 6CNY / Million tokens
Output TPS
~ (500-1000)
~ (50-100)
MiMo-V2.5-Pro-UltraSpeed
MiMo-V2.5-Pro
Input (Cache Hit)
$ 0.0108USD / Million tokens
$ 0.0036USD / Million tokens
Input (Cache Miss)
$ 1.305USD / Million tokens
$ 0.435USD / Million tokens
Output
$ 2.61USD / Million tokens
$ 0.87USD / Million tokens
Output TPS
~ (500-1000)
~ (50-100)

Quantitative Trading

When breaking news drops, the model analyzes market impact and generates trading signals within milliseconds — closing the decision loop before the market moves for truly low-latency quantitative response.

Real-time Risk Control

Complete complex fraud reasoning and risk assessment within hundreds of milliseconds before settlement. Break past the limits of traditional rule engines, balancing real-time speed with deep reasoning.

Scientific Research

Power instant generation and validation of large-scale hypotheses, cutting human-machine latency to near real-time. Eliminate waiting gaps and keep researchers' thinking continuous.

Real-time Coding Assistance

Deliver code generation that outpaces reading speed for zero-perceived-latency completion. Complex refactors finish in an instant, meaningfully improving development continuity.

Inference Experience

MiMo-V2.5-Pro-UltraSpeed vs MiMo-V2.5-Pro Inference Speed Comparison

Instant 3D Racing Game Built with Three.js

Integration

import os
from openai import OpenAI

client = OpenAI(
    api_key=os.environ.get("MIMO_API_KEY"),
    base_url="https://api.xiaomimimo.com/v1"
)

completion = client.chat.completions.create(
    model="mimo-v2.5-pro-ultraspeed",
    messages=[
        {
            "role": "system",
            "content": "You are MiMo, an AI assistant developed by Xiaomi. Today is date: Tuesday, December 16, 2025. Your knowledge cutoff date is December 2024."
        },
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "Generate a modern-style SaaS landing page as a single file containing only HTML, CSS and JavaScript."
                }
            ]
        }
    ],
    max_completion_tokens=131072,
    stream=True
)

print("\n========== [Thinking Content] ==========\n")
answering = False
for chunk in completion:
    if not chunk.choices:
        continue

    delta = chunk.choices[0].delta
    if hasattr(delta, "reasoning_content") and delta.reasoning_content:
        print(delta.reasoning_content, end='', flush=True)

    if hasattr(delta, "content") and delta.content:
        if not answering:
            print("\n\n========== [Answer Content] ==========\n")
            answering = True
        print(delta.content, end='', flush=True)

1T × 1000 tokens/s

MiMo Algorithm Innovation

FP4 Mixed-Precision Quantization

FP4 quantization is applied only to MoE Experts while everything else keeps its original precision. FP4 QAT drastically shrinks model size and maximizes hardware bandwidth while preserving model capability at a near-lossless level.

DFlash Speculative Decoding

Replaces traditional autoregressive drafting with block-level masked parallel prediction. The draft model uses SWA to reduce prediction compute to a constant level, paired with the Muon optimizer and self-distillation for high acceptance rates — translating directly into substantial inference throughput gains.

TileRT System-level Optimization

Persistent Kernel Engine

Abandons per-operator launches; the compute pipeline stays resident on the GPU and runs continuously, with full-path prefetching to maximize overlap between data movement and computation.

Heterogeneous Pipeline Collaboration

Tile-level decomposition assigns communication, data movement, and tensor computation to different warps, evolving the GPU into a continuously flowing, precisely coordinated heterogeneous execution system.

Breaking through 1000 tokens/s for the first time without sacrificing intelligence or requiring custom silicon. Xiaomi has shattered the industry's impossible triangle — that "fast, powerful, and general-purpose GPU can't coexist" — the inevitable result of cutting-edge algorithms and system infrastructure deeply converging and co-evolving toward each other.

Try MiMo-2.5-Pro-UltraSpeed Now

No code needed — experience 1000TPS ultra-fast inference right in your browser

We use cookies and similar technologies of our own to ensure the proper functioning of the website, customize content according to user preferences and analyze users' interactions on the website, as well as their browsing habits. You can find more information in our Cookie Policy. Select an option or go to Cookie Settings to manage your preferences. Learn More.