MiMo-V2-Flash Release Note 2026/02/04

Upgraded Coding Capabilities in Thinking Mode: Specifically optimized for programming scenarios, the Thinking Mode now achieves a score of 78.6 on SWE-Bench Verified. Both the resolution rate and the quality of code generation have been significantly improved.
Substantial Boost in Tool Calling Accuracy: Stability issues regarding tool usage have been resolved. Tool calling accuracy in Thinking Mode has surged from 64% to 97.0%, greatly enhancing execution reliability in Agent scenarios.
Enhanced Instruction Following & Reduced Hallucinations:

Instruction Following: Improved adherence to specific instructions, achieving an AA-IFBench score of 72.
Factuality: Enhanced rigor in factual responses, with the Non-Hallucination Rate updated to 52%.

Optimized Handling of Complex Tasks: Performance on Arena-Hard (Hard Prompts) in Thinking Mode has been strengthened, with the score rising to 60.6. The model now demonstrates superior performance when handling high-difficulty logic problems.
More Efficient Chain-of-Thought (CoT): By optimizing CoT generation strategies, the consumption of redundant tokens has been significantly reduced. In benchmarks such as AIME25 and HMMT, the average generation length has decreased by 13% to 30%. This effectively lowers latency and token costs while maintaining model performance.

	mimo-v2-flash-0204	mimo-v2-flash-0112	mimo-v2-flash
SWE-Bench Verified Non-Thinking	73.7	73.3	73.4
SWE-Bench Verified Thinking	78.6	74.2	-
Arena-Hard(Hard Prompt) Non-Thinking	49.3	52.7	46.0
Arena-Hard(Creative Writing) Non-Thinking	85.0	86.0	78.3
Aren-Hard(Hard Prompt) Thinking	60.6	58.3	54.1
Arena-Hard(Creative Writing) Thinking	85.8	90.4	86.2
AA-IFBench	72	-	64
AA-Omniscience Accuracy	19	-	27
AA-Omniscience Non-Hallucination Rate	52%	-	9%
Tool call success rate Thinking	97.0%	64%	44%

Benchmark	mimo-v2-flash (Acc)	mimo-v2-flash (Avg Tokens)	mimo-v2-flash-0204 (Acc)	mimo-v2-flash-0204 (Avg Tokens)	Length Reduction Ratio (%)
AIME25	94.8	26984	91.1	18879	30.04%
HMMT_Feb_25	94.2	29294	92.9	21470	26.71%
LiveCodeBench-AA	83.2	21488	84.9	18335	14.67%
GPQA-Diamond	83.7	15862	83.8	13659	13.89%

Note: The model API call method and model name remain unchanged

Update Time May 28, 2026