Model profile

Kimi K2.5

Name: Kimi K2.5
Author: Moonshot AI

Moonshot AISupersededReleased Feb 1, 2026

Data verified July 20, 2026

Superseded:Moonshot AI has released newer models in this line —Kimi K2.6

Overall Score

59.66Public #54 of 200Verified #41 of 99

Arena Elo

1400

Eligible category ranks

6of 8

Price (1M tokens)

$0.6 in / $3 out

API pricing

Speed

45tok/s

Context

256K

Evidence coverage

63 of 321 tracked benchmarks are published. 31 are verified and 32 provisional. 8 of 8 categories are measured.

Updated July 20, 2026Methodology

Published / tracked: 63 / 321
Verified: 31
Provisional: 32
Categories with evidence: 8 / 8

Agentic19 benchmarks
Mixed evidence
Coding10 benchmarks
Mixed evidence
Reasoning3 benchmarks
Mixed evidence
Knowledge12 benchmarks
Mixed evidence
Math9 benchmarks
Mixed evidence
Multilingual2 benchmarks
Reported
Multimodal6 benchmarks
Mixed evidence
Inst. Following2 benchmarks
Reported

Open WeightSelf-hostNon-Reasoning

Confidence:

Very high

base

Kimi K2.5 ranks #54 out of 200 models on the public leaderboard with an overall score of 59.66/100. It also ranks #41 out of 99 on the verified leaderboard. While not a frontier model, it offers specific advantages depending on the use case.

Kimi K2.5 is a open weight model with a 256K token context window. It processes queries without explicit chain-of-thought reasoning, offering faster response times and lower token usage.

Kimi K2.5 sits inside the Kimi K2.5 family alongside Kimi K2.5 (Reasoning). This profile currently has 63 of 321 tracked benchmarks. BenchLM only exposes non-generated benchmark rows publicly, so missing categories stay blank until a sourced evaluation is available.

Its strongest category is Mathematics (#5), while its weakest is Agentic (#95). This performance profile makes it particularly strong for mathematical reasoning, scientific computing, and quantitative analysis.

Peer position

Exact provisional scores and ranks for the closest listed peers. A score can appear before a model clears the evidence threshold for a rank, so equal scores can have different rank states.

Range 59.35–59.97

Grok 4.1
xAI
#5159.97
Grok 4.1 is #51 with a score of 59.97.
Compare
GLM-5 (Reasoning)
Z.AI
#5259.77
GLM-5 (Reasoning) is #52 with a score of 59.77.
Compare
Qwen 3.6 Max (preview)
Alibaba
#5359.72
Qwen 3.6 Max (preview) is #53 with a score of 59.72.
Compare
Kimi K2.5Current model
Moonshot AI
#5459.66
Kimi K2.5 is #54 with a score of 59.66.
MiniMax M2.5
MiniMax
#5559.52
MiniMax M2.5 is #55 with a score of 59.52.
Compare
Qwen3.5 397B (Reasoning)
Alibaba
#5659.5
Qwen3.5 397B (Reasoning) is #56 with a score of 59.5.
Compare
Kimi K2.5 (Reasoning)
Moonshot AI
#5759.35
Kimi K2.5 (Reasoning) is #57 with a score of 59.35.
Compare

Category percentile

Relative position among models eligible for each sourced category. A higher percentile means a stronger position within that category's ranked cohort; 100 is highest.

Math33%
Eligible cohort rank #5 of 7Category score 62.8
Inst. Following81%
Eligible cohort rank #8 of 38Category score 90.3
Multilingual42%
Eligible cohort rank #8 of 13Category score 38.2
Coding75%
Eligible cohort rank #31 of 122Category score 57.0
Knowledge22%
Eligible cohort rank #41 of 52Category score 58.6
Agentic20%
Eligible cohort rank #95 of 119Category score 39.6

Category evidence

Scores and ranks appear only where this model has published benchmark evidence. Categories without displayable source records remain not measured.

Category scores, ranks, weighting, benchmark coverage, and evidence status
Category	Score	Rank	Percentile	Weight	Benchmarks	Evidence
AgenticRank #95 of 119Percentile 20thWeight 22%19 benchmarksMixed sources	39.6	#95 of 119	20th	22%	19 benchmarks	Mixed sources
CodingRank #31 of 122Percentile 75thWeight 20%10 benchmarksMixed sources	57.0	#31 of 122	75th	20%	10 benchmarks	Mixed sources
ReasoningRank Not rankedWeight 17%3 benchmarksMixed sources	79.6	Not ranked	Not available	17%	3 benchmarks	Mixed sources
KnowledgeRank #41 of 52Percentile 22ndWeight 12%12 benchmarksMixed sources	58.6	#41 of 52	22nd	12%	12 benchmarks	Mixed sources
MathRank #5 of 7Percentile 33rdWeight 5%9 benchmarksMixed sources	62.8	#5 of 7	33rd	5%	9 benchmarks	Mixed sources
MultilingualRank #8 of 13Percentile 42ndWeight 7%2 benchmarksReported	38.2	#8 of 13	42nd	7%	2 benchmarks	Reported
MultimodalRank Not rankedWeight 12%6 benchmarksMixed sources	65.9	Not ranked	Not available	12%	6 benchmarks	Mixed sources
Inst. FollowingRank #8 of 38Percentile 81stWeight 5%2 benchmarksReported	90.3	#8 of 38	81st	5%	2 benchmarks	Reported

Self-host vs API cost

Estimates at 50,000 req/day · 1000 tokens/req average.

Kimi K2.5

API / mo$2,700

Self-host / mo$5,221

Break-even132M/day

Model the full break-even

Chatbot Arena performance

Scroll horizontally to inspect confidence intervals and vote counts.

Chatbot Arena Elo, confidence interval, and vote count by evaluation view
View	Elo	Confidence interval	Votes
Text Overall	1400	Not available	Not available

Benchmark Details

Rows below have a displayable published verification record. Each source link and provenance note remains in the page HTML while its category is closed. Source-unverified manual rows and generated rows stay hidden.

Agentic19 benchmarks

Terminal-Bench 2.0Provider exact

50.8%Weighted 38%

Source: MoonshotAI: Kimi K2.5 model cardProvenance: Provider exact

BrowseCompProvider exact

60.6%Weighted 28%

Source: MoonshotAI: Kimi K2.5 model cardProvenance: Provider exact

Claw-EvalBenchmark exact

52.3%Display only

Source: Claw-Eval leaderboardProvenance: Claw-Eval reports this model as kimi_k25 in the official 2026-05-09 leaderboard snapshot. BenchLM stores the primary Pass^3 value on the local Claw-Eval display key.

QwenClawBenchSecondary exact

54.3%Display only

Source: Qwen3.6-Plus comparison tableProvenance: Secondary exact

τ³-bench resultsSecondary exact

τ³-Bench Tool-Agent-User Evaluation

65.7%Display only

Source: Qwen3.6-Plus comparison tableProvenance: Secondary exact

DeepSearchQAProvider exact

77.1%Display only

Source: MoonshotAI: Kimi K2.5 model cardProvenance: MoonshotAI reports DeepSearchQA at 77.1 in the agentic search evaluation table.

DeepPlanningSecondary exact

14.4%Display only

Source: Qwen3.6-Plus comparison tableProvenance: Secondary exact

ToolathlonProvider exact

27.8%Display only

Source: MoonshotAI: Kimi K2.5 model cardProvenance: MoonshotAI reports Toolathlon at 27.8 in the agentic search evaluation table.

MCP AtlasProvider exact

29.5%Display only

Source: MoonshotAI: Kimi K2.5 model cardProvenance: MoonshotAI reports MCPMark at 29.5 in the agentic search evaluation table. BenchLM maps this to the MCP Atlas display key.

MCP-TasksSecondary exact

59.1%Display only

Source: Qwen3.6-Plus comparison tableProvenance: Secondary exact

WideResearchProvider exact

72.7%Display only

Source: MoonshotAI: Kimi K2.5 model cardProvenance: MoonshotAI reports WideSearch at 72.7 item-f1 in the agentic search evaluation table. BenchLM maps this to the Wide Research display key.

τ²-bench resultsReported

τ²-Bench Tool-Agent-User Evaluation

95.9%Display only

Source: Artificial Analysis model benchmarksProvenance: Display-only row synced from the current Artificial Analysis model payload. It is excluded from BenchLM weighted scoring.

APEX-Agents-AAReported

11.5%Display only

Source: Artificial Analysis model benchmarksProvenance: Display-only row synced from the current Artificial Analysis model payload. It is excluded from BenchLM weighted scoring.

Gert LabsBenchmark exact

Gert Labs Composite Game Benchmark

45.88%Display only

Source: Gert Labs rankingsProvenance: Gert Labs reports this composite leaderboard score in the public rankings API. BenchLM scales the source gscore from 0-1 to 0-100 and stores it as a display-only agentic benchmark.

ResearchClawBenchBenchmark exact

14.0%Display only

Source: ResearchClawBench leaderboardProvenance: ResearchClawBench reports this model as ResearchHarness (Kimi-K2.5) in the official Pass@1 leaderboard. BenchLM stores the one-decimal RADS average on the local ResearchClawBench display key and excludes it from weighted rankings.

JobBenchBenchmark exact

8.7%Display only

Source: JobBench paperProvenance: JobBench reports Kimi-K2.5 under OpenCode on the main-set leaderboard. BenchLM stores the reported main-set score.

AA Agentic IndexReported

Artificial Analysis Agentic Index

21.7%Display only

Source: Artificial Analysis model benchmarksProvenance: Display-only row synced from the current Artificial Analysis model payload. It is excluded from BenchLM weighted scoring.

GDPval-AAReported

GDPval-AA normalized

25.4%Display only

Source: Artificial Analysis model benchmarksProvenance: Display-only row synced from the current Artificial Analysis model payload. It is excluded from BenchLM weighted scoring.

GDPval-AAReported

1009Display only

Source: Artificial Analysis model benchmarksProvenance: Display-only row synced from the current Artificial Analysis model payload. It is excluded from BenchLM weighted scoring.

Coding10 benchmarks

SWE-RebenchBenchmark exact

58.5%Weighted 20%

Source: SWE-Rebench leaderboardProvenance: Public SWE-Rebench leaderboard lists Kimi K2.5 at 58.5% resolved rate.

SWE-bench VerifiedProvider exact

Software Engineering Benchmark Verified

76.8%Weighted 16%

Source: MoonshotAI: Kimi K2.5 model cardProvenance: Provider exact

SciCodeProvider exact

Scientific Code Benchmark

48.7%Weighted 16%

Source: MoonshotAI: Kimi K2.5 model cardProvenance: MoonshotAI reports SciCode at 48.7 in the coding evaluation table.

SWE-bench ProProvider exact

50.7%Weighted 10%

Source: MoonshotAI: Kimi K2.5 model cardProvenance: MoonshotAI reports SWE-Bench Pro at 50.7 in the coding evaluation table.

SWE-bench Verified*Secondary exact

SWE-bench Verified (mini-swe-agent-v2)

70.8%Display only

Source: Arcee Trinity-Large-Thinking comparison tableProvenance: Secondary exact

LiveCodeBench v6Provider exact

85.0%Display only

Source: MoonshotAI: Kimi K2.5 model cardProvenance: MoonshotAI reports LiveCodeBench (v6) at 85.0 in the coding evaluation table.

SWE MultilingualProvider exact

73%Display only

Source: MoonshotAI: Kimi K2.5 model cardProvenance: MoonshotAI reports SWE-Bench Multilingual at 73.0 in the coding evaluation table.

React Native EvalsBenchmark exact

77.2%Display only

Source: React Native Evals leaderboardProvenance: React Native Evals reports this exact overall score for Kimi K2.5 in the public dashboard run finished on 2026-04-28.

AA-SciCodeReported

Artificial Analysis SciCode

49.0%Display only

Source: Artificial Analysis model benchmarksProvenance: Display-only row synced from the current Artificial Analysis model payload. It is excluded from BenchLM weighted scoring.

AA Coding IndexReported

Artificial Analysis Coding Index

46.8%Display only

Source: Artificial Analysis model benchmarksProvenance: Display-only row synced from the current Artificial Analysis model payload. It is excluded from BenchLM weighted scoring.

Reasoning3 benchmarks

LongBench v2Provider exact

61%Weighted 38%

Source: MoonshotAI: Kimi K2.5 model cardProvenance: MoonshotAI reports Longbench v2 at 61.0 in the long-context benchmark table.

AA-LCRReported

Artificial Analysis Long Context Reasoning

65.3%Display only

Source: Artificial Analysis model benchmarksProvenance: Display-only row synced from the current Artificial Analysis model payload. It is excluded from BenchLM weighted scoring.

CritPtReported

Critical Physics Tasks

3.1%Display only

Source: Artificial Analysis model benchmarksProvenance: Display-only row synced from the current Artificial Analysis model payload. It is excluded from BenchLM weighted scoring.

Knowledge12 benchmarks

HLEProvider exact

Humanity's Last Exam

30.1%Weighted 45%

Source: MoonshotAI: Kimi K2.5 model cardProvenance: MoonshotAI reports HLE-Full at 30.1 for Kimi K2.5 in the reasoning and knowledge evaluation table.

MMLU-ProProvider exact

Massive Multitask Language Understanding Professional

87.1%Weighted 30%

Source: MoonshotAI: Kimi K2.5 model cardProvenance: Provider exact

GPQAProvider exact

Graduate-Level Google-Proof Q&A

87.6%Weighted 7%

Source: MoonshotAI: Kimi K2.5 model cardProvenance: Provider exact

SuperGPQASecondary exact

SuperGPQA: Scaling LLM Evaluation Across 285 Graduate Disciplines

69.2%Weighted 7%

Source: Qwen3.6-Plus comparison tableProvenance: Secondary exact

GPQA-DProvider exact

GPQA Diamond

87.6%Display only

Source: MoonshotAI: Kimi K2.5 model cardProvenance: MoonshotAI reports GPQA-Diamond at 87.6 in the reasoning and knowledge evaluation table.

MMLU-Pro (Arcee)Secondary exact

MMLU-Pro first-party comparison snapshot

87.1%Display only

Source: Arcee Trinity-Large-Thinking comparison tableProvenance: Secondary exact

Artificial Analysis Intelligence IndexReported

35.4%Display only

Source: Artificial Analysis model benchmarksProvenance: Display-only row synced from the current Artificial Analysis model payload. It is excluded from BenchLM weighted scoring.

AA-GPQA DiamondReported

Artificial Analysis GPQA Diamond

87.9%Display only

Source: Artificial Analysis model benchmarksProvenance: Display-only row synced from the current Artificial Analysis model payload. It is excluded from BenchLM weighted scoring.

AA-HLEReported

Artificial Analysis Humanity's Last Exam

29.4%Display only

Source: Artificial Analysis model benchmarksProvenance: Display-only row synced from the current Artificial Analysis model payload. It is excluded from BenchLM weighted scoring.

AA-Omniscience IndexReported

Artificial Analysis Omniscience Index

-8.1%Display only

Source: Artificial Analysis model benchmarksProvenance: Display-only row synced from the current Artificial Analysis model payload. It is excluded from BenchLM weighted scoring.

AA-Omniscience AccuracyReported

Artificial Analysis Omniscience Accuracy

34.3%Display only

Source: Artificial Analysis model benchmarksProvenance: Display-only row synced from the current Artificial Analysis model payload. It is excluded from BenchLM weighted scoring.

AA-Omniscience Hallucination RateReported

Artificial Analysis Omniscience Hallucination Rate

64.6%Display only

Source: Artificial Analysis model benchmarksProvenance: Display-only row synced from the current Artificial Analysis model payload. It is excluded from BenchLM weighted scoring.

Math9 benchmarks

FrontierMath v2 (Tiers 1-3)Benchmark exact

FrontierMath v2 Tiers 1-3

27.900%Weighted 30%

Source: Epoch AI FrontierMath v2 leaderboardProvenance: Epoch AI reports FrontierMath v2 Tiers 1-3 at 27.9% for fireworks/kimi-k2p5. BenchLM selects the highest published thinking effort for the model and stores the v2 benchmark slice separately.

AIME26Provider exact

AIME 2026

95.8%Weighted 25%

Source: MoonshotAI: Kimi K2.5 model cardProvenance: MoonshotAI reports AIME 2026 at 95.8 in the reasoning and knowledge evaluation table.

HMMT Feb 2026Provider exact

Harvard-MIT Mathematics Tournament February 2026

87.1%Weighted 25%

Source: MoonshotAI: Kimi K2.5 model cardProvenance: MoonshotAI publishes exact HMMT-family comparison rows for Kimi K2.5 on the model card benchmark table.

FrontierMath v2 (Tier 4)Benchmark exact

FrontierMath v2 Tier 4

4.200%Weighted 10%

Source: Epoch AI FrontierMath v2 leaderboardProvenance: Epoch AI reports FrontierMath v2 Tier 4 at 4.2% for fireworks/kimi-k2p5. BenchLM selects the highest published thinking effort for the model and stores the v2 benchmark slice separately.

AIME 2025Provider exact

American Invitational Mathematics Examination 2025

96.1%Display only

Source: MoonshotAI: Kimi K2.5 model cardProvenance: MoonshotAI reports AIME 2025 at 96.1 in the reasoning and knowledge evaluation table.

AIME25 (Arcee)Secondary exact

AIME25 first-party comparison snapshot

96.3%Display only

Source: Arcee Trinity-Large-Thinking comparison tableProvenance: Secondary exact

HMMT Feb 2025Provider exact

Harvard-MIT Mathematics Tournament February 2025

95.4%Display only

Source: MoonshotAI: Kimi K2.5 model cardProvenance: MoonshotAI reports HMMT 2025 (Feb) at 95.4 in the reasoning and knowledge evaluation table.

HMMT Nov 2025Provider exact

Harvard-MIT Mathematics Tournament November 2025

91.1%Display only

Source: MoonshotAI: Kimi K2.5 model cardProvenance: MoonshotAI publishes exact HMMT-family rows for Kimi K2.5 and the stored hmmtNov2025 row matches the public figure.

MMAnswerBenchProvider exact

81.8%Display only

Source: MoonshotAI: Kimi K2.5 model cardProvenance: MoonshotAI reports IMO-AnswerBench at 81.8 in the reasoning and knowledge evaluation table.

Multilingual2 benchmarks

MMLU-ProXSecondary exact

82.3%Weighted 100%

Source: Qwen3.6-Plus comparison tableProvenance: Secondary exact

NOVA-63Secondary exact

56.0%Display only

Source: Qwen3.6-Plus comparison tableProvenance: Secondary exact

Multimodal6 benchmarks

MMMU-ProProvider exact

Massive Multi-discipline Multimodal Understanding Pro

78.5%Weighted 45%

Source: MoonshotAI: Kimi K2.5 model cardProvenance: Provider exact

Video-MMEReported

87.4%Display only

Source: Reported upstream sourceProvenance: Reported row carried from an upstream public source. Displayable on BenchLM, but not treated as verified unless explicitly marked otherwise.

MMVUReported

Multimodal Multi-disciplinary Video Understanding

80.4%Display only

Source: Reported upstream sourceProvenance: Reported row carried from an upstream public source. Displayable on BenchLM, but not treated as verified unless explicitly marked otherwise.

VideoMMMUSecondary exact

86.6%Display only

Source: Qwen3.6-Plus multimodal comparison tableProvenance: Secondary exact

AA-MMMU-ProReported

Artificial Analysis MMMU-Pro

75.4%Display only

Source: Artificial Analysis model benchmarksProvenance: Display-only row synced from the current Artificial Analysis model payload. It is excluded from BenchLM weighted scoring.

Design Arena WebsiteReported

Design Arena Website Elo

1282Display only

Source: OpenRouter model benchmarksProvenance: Display-only Design Arena Website Elo synced from OpenRouter model benchmark metadata. It is excluded from BenchLM weighted scoring.

Inst. Following2 benchmarks

IFEvalSecondary exact

Instruction-Following Eval

93.9%Weighted 35%

Source: Qwen3.6-Plus comparison tableProvenance: Secondary exact

AA-IFBenchReported

Artificial Analysis IFBench

70.2%Display only

Source: Artificial Analysis model benchmarksProvenance: Display-only row synced from the current Artificial Analysis model payload. It is excluded from BenchLM weighted scoring.

Kimi K2.5 Family

Base entry

Kimi K2.5 (Reasoning)Score 59.35

Frequently Asked Questions

How does Kimi K2.5 perform overall in AI benchmarks?

Kimi K2.5 currently ranks #54 out of 200 models on BenchLM's provisional leaderboard with an overall score of 59.66. It also ranks #41 out of 99 on the verified leaderboard. It is created by Moonshot AI. Its published context window is 256K.

Is Kimi K2.5 good for knowledge and understanding?

Kimi K2.5 ranks #41 out of 52 models in knowledge and understanding benchmarks with an average score of 58.6. There are stronger options in this category.

Is Kimi K2.5 good for coding and programming?

Kimi K2.5 ranks #31 out of 122 models in coding and programming benchmarks with an average score of 57. There are stronger options in this category.

Is Kimi K2.5 good for mathematics?

Kimi K2.5 ranks #5 out of 7 models in mathematics benchmarks with an average score of 62.8. It is among the top performers in this category.

Is Kimi K2.5 good for reasoning and logic?

Kimi K2.5 has visible benchmark coverage in reasoning and logic, but BenchLM does not currently assign it a global category rank there.

Is Kimi K2.5 good for agentic tool use and computer tasks?

Kimi K2.5 ranks #95 out of 119 models in agentic tool use and computer tasks benchmarks with an average score of 39.6. There are stronger options in this category.

Is Kimi K2.5 good for multimodal and grounded tasks?

Kimi K2.5 has visible benchmark coverage in multimodal and grounded tasks, but BenchLM does not currently assign it a global category rank there.

Is Kimi K2.5 good for instruction following?

Kimi K2.5 ranks #8 out of 38 models in instruction following benchmarks with an average score of 90.3. It is among the top performers in this category.

Is Kimi K2.5 good for multilingual tasks?

Kimi K2.5 ranks #8 out of 13 models in multilingual tasks benchmarks with an average score of 38.2. It is among the top performers in this category.

Is Kimi K2.5 open source?

Yes, Kimi K2.5 is an open weight model created by Moonshot AI, meaning it can be downloaded and run locally or fine-tuned for specific use cases.

Which sibling models are related to Kimi K2.5?

Kimi K2.5 belongs to the Kimi K2.5 family. Related variants on BenchLM include Kimi K2.5 (Reasoning).

Does Kimi K2.5 have full benchmark coverage on BenchLM?

Not yet. Kimi K2.5 currently has 63 published benchmark scores out of the 321 benchmarks BenchLM tracks. BenchLM only exposes non-generated public benchmark rows, so missing categories stay blank until a sourced evaluation is available.

What is the context window size of Kimi K2.5?

Kimi K2.5 has a published context window of 256K, which determines how much text it can process in a single interaction.

Related Resources

Last updated: July 20, 2026 · Runtime metrics stay blank until BenchLM has a sourced snapshot.

Choose with this week’s evidence

Join 2,000+ readers for ranking moves, new releases, pricing changes, and the evidence behind them.

Free. One email per week.

Kimi K2.5

Evidence coverage

Evidence by category

Peer position

Category percentile

Category evidence

Self-host vs API cost

Chatbot Arena performance

Benchmark Details

Kimi K2.5 Family

Frequently Asked Questions

How does Kimi K2.5 perform overall in AI benchmarks?

Is Kimi K2.5 good for knowledge and understanding?

Is Kimi K2.5 good for coding and programming?

Is Kimi K2.5 good for mathematics?

Is Kimi K2.5 good for reasoning and logic?

Is Kimi K2.5 good for agentic tool use and computer tasks?

Is Kimi K2.5 good for multimodal and grounded tasks?

Is Kimi K2.5 good for instruction following?

Is Kimi K2.5 good for multilingual tasks?

Is Kimi K2.5 open source?

Which sibling models are related to Kimi K2.5?

Does Kimi K2.5 have full benchmark coverage on BenchLM?

What is the context window size of Kimi K2.5?

Related Resources

Choose with this week’s evidence