MiMo-V2-Flash Release Note 2026/02/04
-
Upgraded Coding Capabilities in Thinking Mode: Specifically optimized for programming scenarios, the Thinking Mode now achieves a score of 78.6 on SWE-Bench Verified. Both the resolution rate and the quality of code generation have been significantly improved.
-
Substantial Boost in Tool Calling Accuracy: Stability issues regarding tool usage have been resolved. Tool calling accuracy in Thinking Mode has surged from 64% to 97.0%, greatly enhancing execution reliability in Agent scenarios.
-
Enhanced Instruction Following & Reduced Hallucinations:
-
Instruction Following: Improved adherence to specific instructions, achieving an AA-IFBench score of 72.
-
Factuality: Enhanced rigor in factual responses, with the Non-Hallucination Rate updated to 52%.
-
Optimized Handling of Complex Tasks: Performance on Arena-Hard (Hard Prompts) in Thinking Mode has been strengthened, with the score rising to 60.6. The model now demonstrates superior performance when handling high-difficulty logic problems.
-
More Efficient Chain-of-Thought (CoT): By optimizing CoT generation strategies, the consumption of redundant tokens has been significantly reduced. In benchmarks such as AIME25 and HMMT, the average generation length has decreased by 13% to 30%. This effectively lowers latency and token costs while maintaining model performance.
| mimo-v2-flash-0204 | mimo-v2-flash-0112 | mimo-v2-flash | |
|---|---|---|---|
| SWE-Bench Verified Non-Thinking |
73.7 | 73.3 | 73.4 |
| SWE-Bench Verified Thinking |
78.6 | 74.2 | - |
| Arena-Hard(Hard Prompt) Non-Thinking |
49.3 | 52.7 | 46.0 |
| Arena-Hard(Creative Writing) Non-Thinking |
85.0 | 86.0 | 78.3 |
| Aren-Hard(Hard Prompt) Thinking |
60.6 | 58.3 | 54.1 |
| Arena-Hard(Creative Writing) Thinking |
85.8 | 90.4 | 86.2 |
| AA-IFBench | 72 | - | 64 |
| AA-Omniscience Accuracy | 19 | - | 27 |
| AA-Omniscience Non-Hallucination Rate | 52% | - | 9% |
| Tool call success rate Thinking |
97.0% | 64% | 44% |
| Benchmark | mimo-v2-flash (Acc) | mimo-v2-flash (Avg Tokens) | mimo-v2-flash-0204 (Acc) | mimo-v2-flash-0204 (Avg Tokens) | Length Reduction Ratio (%) |
|---|---|---|---|---|---|
| AIME25 | 94.8 | 26984 | 91.1 | 18879 | 30.04% |
| HMMT_Feb_25 | 94.2 | 29294 | 92.9 | 21470 | 26.71% |
| LiveCodeBench-AA | 83.2 | 21488 | 84.9 | 18335 | 14.67% |
| GPQA-Diamond | 83.7 | 15862 | 83.8 | 13659 | 13.89% |
Note: The model API call method and model name remain unchanged