Model profile

MiMo-V2.5

Name: MiMo-V2.5
Author: Xiaomi

XiaomiCurrentReleased Apr 22, 2026

Data verified July 21, 2026

Overall Score

58.62Public #62 of 200

Arena Elo

1433

Eligible category ranks

3of 8

Price (1M tokens)

Not listedAPI pricing

Speed

Not listed

Context

Evidence coverage

11 of 323 tracked benchmarks are published. 10 are verified and 1 provisional. 3 of 8 categories are measured.

Updated July 21, 2026Methodology

Published / tracked: 11 / 323
Verified: 10
Provisional: 1
Categories with evidence: 3 / 8

Agentic5 benchmarks
Verified
Coding2 benchmarks
Verified
Reasoning0 benchmarks
Not measured
Knowledge0 benchmarks
Not measured
Math0 benchmarks
Not measured
Multilingual0 benchmarks
Not measured
Multimodal4 benchmarks
Mixed evidence
Inst. Following0 benchmarks
Not measured

ProprietaryReasoning

Confidence:

Medium

base

MiMo-V2.5 ranks #62 out of 200 models on the public leaderboard with an overall score of 58.62/100. It does not yet have enough sourced coverage for BenchLM's verified leaderboard. While not a frontier model, it offers specific advantages depending on the use case.

MiMo-V2.5 is a proprietary model with a 1M token context window. It uses explicit chain-of-thought reasoning, which typically improves performance on math and complex reasoning tasks at the cost of higher latency and token usage.

MiMo-V2.5 sits inside the MiMo-V2.5 family alongside MiMo-V2.5-Pro. BenchLM links it directly to MiMo-V2-Omni as the earlier related model in that lineage. This profile currently has 11 of 323 tracked benchmarks. BenchLM only exposes non-generated benchmark rows publicly, so missing categories stay blank until a sourced evaluation is available.

Its strongest category is Multimodal & Grounded (#19), while its weakest is Coding (#60). This performance profile makes it particularly strong for screenshots, documents, charts, and grounded multimodal workflows.

Peer position

Exact provisional scores and ranks for the closest listed peers. A score can appear before a model clears the evidence threshold for a rank, so equal scores can have different rank states.

Range 58.15–59.0

GPT-5.2 Instant
OpenAI
#5959.0
GPT-5.2 Instant is #59 with a score of 59.0.
Compare
GPT-5.3 Instant
OpenAI
#6058.9
GPT-5.3 Instant is #60 with a score of 58.9.
Compare
DeepSeek V4 Flash
DeepSeek
#6158.88
DeepSeek V4 Flash is #61 with a score of 58.88.
Compare
MiMo-V2.5Current model
Xiaomi
#6258.62
MiMo-V2.5 is #62 with a score of 58.62.
GPT-5 (high)
OpenAI
#6358.61
GPT-5 (high) is #63 with a score of 58.61.
Compare
GPT-5.2
OpenAI
#6458.43
GPT-5.2 is #64 with a score of 58.43.
Compare
DeepSeek V3.2 (Thinking)
DeepSeek
#6558.15
DeepSeek V3.2 (Thinking) is #65 with a score of 58.15.
Compare

Category percentile

Relative position among models eligible for each sourced category. A higher percentile means a stronger position within that category's ranked cohort; 100 is highest.

Multimodal36%
Eligible cohort rank #19 of 29Category score 63.3
Agentic77%
Eligible cohort rank #28 of 119Category score 53.8
Coding51%
Eligible cohort rank #60 of 122Category score 50.6

Category evidence

Scores and ranks appear only where this model has published benchmark evidence. Categories without displayable source records remain not measured.

Category scores, ranks, weighting, benchmark coverage, and evidence status
Category	Score	Rank	Percentile	Weight	Benchmarks	Evidence
AgenticRank #28 of 119Percentile 77thWeight 22%5 benchmarksVerified	53.8	#28 of 119	77th	22%	5 benchmarks	Verified
CodingRank #60 of 122Percentile 51stWeight 20%2 benchmarksVerified	50.6	#60 of 122	51st	20%	2 benchmarks	Verified
ReasoningWeight 17%0 benchmarksNot measured	Not measured	Not ranked	Not available	17%	0 benchmarks	Not measured
KnowledgeWeight 12%0 benchmarksNot measured	Not measured	Not ranked	Not available	12%	0 benchmarks	Not measured
MathWeight 5%0 benchmarksNot measured	Not measured	Not ranked	Not available	5%	0 benchmarks	Not measured
MultilingualWeight 7%0 benchmarksNot measured	Not measured	Not ranked	Not available	7%	0 benchmarks	Not measured
MultimodalRank #19 of 29Percentile 36thWeight 12%4 benchmarksMixed sources	63.3	#19 of 29	36th	12%	4 benchmarks	Mixed sources
Inst. FollowingWeight 5%0 benchmarksNot measured	Not measured	Not ranked	Not available	5%	0 benchmarks	Not measured

Chatbot Arena performance

Scroll horizontally to inspect confidence intervals and vote counts.

Chatbot Arena Elo, confidence interval, and vote count by evaluation view
View	Elo	Confidence interval	Votes
Text Overall	1433	±4.5	42,281
Coding	1491	±6.9	12,219
Math	1440	±12.8	2,211
Instruction Following	1431	±6.5	14,034
Creative Writing	1392	±8.3	6,912
Multi-turn	1449	±8.0	7,452
Hard Prompts	1462	±5.2	28,157
Hard Prompts (English)	1471	±6.6	13,584
Longer Query	1452	±6.2	18,486

Benchmark Details

Rows below have a displayable published verification record. Each source link and provenance note remains in the page HTML while its category is closed. Source-unverified manual rows and generated rows stay hidden.

Agentic5 benchmarks

Terminal-Bench 2.0Provider exact

65.8%Weighted 38%

Source: Xiaomi MiMo-V2.5Provenance: Provider exact

Claw-EvalBenchmark exact

62.3%Display only

Source: Claw-Eval leaderboardProvenance: Claw-Eval reports this model as mimo_v25 in the official 2026-05-09 leaderboard snapshot. BenchLM stores the primary Pass^3 value on the local Claw-Eval display key.

MM-ClawBenchProvider exact

23.8%Display only

Source: Xiaomi MiMo-V2.5Provenance: Xiaomi reports Claw-Eval Multimodal at 23.8 on the MiMo-V2.5 launch page. BenchLM maps that to MM-ClawBench.

Gert LabsBenchmark exact

Gert Labs Composite Game Benchmark

46.89%Display only

Source: Gert Labs rankingsProvenance: Gert Labs reports this composite leaderboard score in the public rankings API. BenchLM scales the source gscore from 0-1 to 0-100 and stores it as a display-only agentic benchmark.

ResearchClawBenchBenchmark exact

16.9%Display only

Source: ResearchClawBench leaderboardProvenance: ResearchClawBench reports this model as ResearchHarness (MiMo-V2.5) in the official Pass@1 leaderboard. BenchLM stores the one-decimal RADS average on the local ResearchClawBench display key and excludes it from weighted rankings.

Coding2 benchmarks

SWE-bench ProProvider exact

56.1%Weighted 10%

Source: Xiaomi MiMo-V2.5Provenance: Provider exact

Terminal-Bench 2.0Provider exact

65.8%Display only

Source: Xiaomi MiMo-V2.5Provenance: Provider exact

Multimodal4 benchmarks

MMMU-ProProvider exact

Massive Multi-discipline Multimodal Understanding Pro

77.9%Weighted 45%

Source: Xiaomi MiMo-V2.5Provenance: Provider exact

CharXivProvider exact

CharXiv Reasoning

81%Weighted 25%

Source: Xiaomi MiMo-V2.5Provenance: Provider exact

Video-MME (with subtitle)Provider exact

Video-MME with subtitle

87.7%Display only

Source: Xiaomi MiMo-V2.5Provenance: Provider exact

Design Arena WebsiteReported

Design Arena Website Elo

1291Display only

Source: OpenRouter model benchmarksProvenance: Display-only Design Arena Website Elo synced from OpenRouter model benchmark metadata. It is excluded from BenchLM weighted scoring.

MiMo-V2.5 Family

Base entry

Related Earlier Model

MiMo-V2-Omni

MiMo-V2.5-ProScore 70.19

Frequently Asked Questions

How does MiMo-V2.5 perform overall in AI benchmarks?

MiMo-V2.5 has 11 published benchmark scores on BenchLM, but it does not yet have enough non-generated coverage to receive a global overall rank.

Is MiMo-V2.5 good for coding and programming?

MiMo-V2.5 ranks #60 out of 122 models in coding and programming benchmarks with an average score of 50.6. There are stronger options in this category.

Is MiMo-V2.5 good for agentic tool use and computer tasks?

MiMo-V2.5 ranks #28 out of 119 models in agentic tool use and computer tasks benchmarks with an average score of 53.8. There are stronger options in this category.

Is MiMo-V2.5 good for multimodal and grounded tasks?

MiMo-V2.5 ranks #19 out of 29 models in multimodal and grounded tasks benchmarks with an average score of 63.3. There are stronger options in this category.

Which sibling models are related to MiMo-V2.5?

MiMo-V2.5 belongs to the MiMo-V2.5 family. Related variants on BenchLM include MiMo-V2.5-Pro.

Does MiMo-V2.5 have full benchmark coverage on BenchLM?

Not yet. MiMo-V2.5 currently has 11 published benchmark scores out of the 323 benchmarks BenchLM tracks. BenchLM only exposes non-generated public benchmark rows, so missing categories stay blank until a sourced evaluation is available.

What is the context window size of MiMo-V2.5?

MiMo-V2.5 has a published context window of 1M, which determines how much text it can process in a single interaction.

Related Resources

Last updated: July 21, 2026 · Runtime metrics stay blank until BenchLM has a sourced snapshot.

Choose with this week’s evidence

Join 2,000+ readers for ranking moves, new releases, pricing changes, and the evidence behind them.

Free. One email per week.

MiMo-V2.5

Evidence coverage

Evidence by category

Peer position

Category percentile

Category evidence

Chatbot Arena performance

Benchmark Details

MiMo-V2.5 Family

Frequently Asked Questions

How does MiMo-V2.5 perform overall in AI benchmarks?

Is MiMo-V2.5 good for coding and programming?

Is MiMo-V2.5 good for agentic tool use and computer tasks?

Is MiMo-V2.5 good for multimodal and grounded tasks?

Which sibling models are related to MiMo-V2.5?

Does MiMo-V2.5 have full benchmark coverage on BenchLM?

What is the context window size of MiMo-V2.5?

Related Resources

Choose with this week’s evidence