Model profile

Kimi K2.6

Name: Kimi K2.6
Author: Moonshot AI

Moonshot AISupersededReleased Apr 20, 2026

Data verified July 13, 2026

Superseded:Moonshot AI has released newer models in this line —Kimi K2.7 Code

Kimi K2.6 by Moonshot AI scores 74/100 on BenchLM's provisional leaderboard (#20 of 79), with an Arena Elo of 1461 and a 256K-token context window. API pricing is $0.95/$4 per million input/output tokens. Newer replacements: Kimi K2.7 Code.

Overall Score

74Prov. #20 of 79Verified #13 of 35

Arena Elo

1461

Categories Ranked

4of 8

Price (1M tokens)

$0.95 in / $4 out

API pricing

Speed

Not listed

Context

256K

Evidence coverage

60 of 296 tracked benchmarks are published. 33 are verified and 27 provisional. 7 of 8 categories are measured.

Updated July 13, 2026Methodology

Published / tracked: 60 / 296
Verified: 33
Provisional: 27
Categories measured: 7 / 8

Agentic22 benchmarks
Mixed evidence
Coding13 benchmarks
Mixed evidence
Reasoning2 benchmarks
Reported
Knowledge10 benchmarks
Mixed evidence
Math5 benchmarks
Verified
Multilingual0 benchmarks
Not measured
Multimodal7 benchmarks
Mixed evidence
Inst. Following1 benchmark
Reported

Open WeightSelf-hostReasoning

Confidence:

High

base

According to BenchLM.ai, Kimi K2.6 ranks #20 out of 79 models on the provisional leaderboard with an overall score of 74/100. It also ranks #13 out of 35 on the verified leaderboard. This places it in the mid-tier of AI models, with strengths in specific benchmark categories.

Kimi K2.6 is a open weight model with a 256K token context window. It uses explicit chain-of-thought reasoning, which typically improves performance on math and complex reasoning tasks at the cost of higher latency and token usage.

BenchLM links it directly to Kimi K2.5 as the earlier related model in that lineage. This profile currently has 60 of 296 tracked benchmarks. BenchLM only exposes non-generated benchmark rows publicly, so missing categories stay blank until a sourced evaluation is available.

Its strongest category is Mathematics (#2), while its weakest is Multimodal & Grounded (#29). This performance profile makes it particularly strong for mathematical reasoning, scientific computing, and quantitative analysis.

Peer position

Exact provisional scores and ranks for the closest listed peers.

Range 74.0–75.0

GPT-5.2
OpenAI
#1975.0
GPT-5.2 is #19 with a score of 75.0.
Compare
Kimi K2.6Current model
Moonshot AI
#2074.0
Kimi K2.6 is #20 with a score of 74.0.
DeepSeek V4 Pro (High)
DeepSeek
#2174.0
DeepSeek V4 Pro (High) is #21 with a score of 74.0.
Compare
GPT-5.5 Pro
OpenAI
Unranked75.0
GPT-5.5 Pro is Unranked with a score of 75.0.
Compare
Holo3-35B-A3B
H Company
Unranked74.0
Holo3-35B-A3B is Unranked with a score of 74.0.
Compare
MiMo-V2.5-Pro
Xiaomi
Unranked74.0
MiMo-V2.5-Pro is Unranked with a score of 74.0.
Compare
SWE-1.7
Cognition
Unranked74.0
SWE-1.7 is Unranked with a score of 74.0.
Compare

Category percentile

Relative position among models eligible for each sourced category. A higher percentile means a stronger position within that category's ranked cohort; 100 is highest.

Math90%
Eligible cohort rank #2 of 11Category score 71.8
Coding95%
Eligible cohort rank #6 of 101Category score 88.1
Agentic86%
Eligible cohort rank #18 of 120Category score 80.2
Multimodal74%
Eligible cohort rank #29 of 110Category score 68.0

Category evidence

Scores and ranks appear only where this model has published benchmark evidence. Categories without displayable source records remain not measured.

Category scores, ranks, weighting, benchmark coverage, and evidence status
Category	Score	Rank	Percentile	Weight	Benchmarks	Evidence
AgenticRank #18 of 120Percentile 86thWeight 22%22 benchmarksMixed sources	80.2	#18 of 120	86th	22%	22 benchmarks	Mixed sources
CodingRank #6 of 101Percentile 95thWeight 20%13 benchmarksMixed sources	88.1	#6 of 101	95th	20%	13 benchmarks	Mixed sources
ReasoningRank Not rankedWeight 17%2 benchmarksReported	0.0	Not ranked	Not available	17%	2 benchmarks	Reported
KnowledgeRank Not rankedWeight 12%10 benchmarksMixed sources	63.4	Not ranked	Not available	12%	10 benchmarks	Mixed sources
MathRank #2 of 11Percentile 90thWeight 5%5 benchmarksVerified	71.8	#2 of 11	90th	5%	5 benchmarks	Verified
MultilingualWeight 7%0 benchmarksNot measured	Not measured	Not ranked	Not available	7%	0 benchmarks	Not measured
MultimodalRank #29 of 110Percentile 74thWeight 12%7 benchmarksMixed sources	68.0	#29 of 110	74th	12%	7 benchmarks	Mixed sources
Inst. FollowingRank Not rankedWeight 5%1 benchmarkReported	0.0	Not ranked	Not available	5%	1 benchmark	Reported

Self-host vs API cost

Estimates at 50,000 req/day · 1000 tokens/req average.

Kimi K2.6

API / mo$3,713

Self-host / mo$18,221

Break-even326M/day

Model the full break-even

Chatbot Arena performance

Scroll horizontally to inspect confidence intervals and vote counts.

Chatbot Arena Elo, confidence interval, and vote count by evaluation view
View	Elo	Confidence interval	Votes
Text Overall	1461	±4.7	32,657
Coding	1512	±7.5	8,944
Math	1485	±14.3	1,718
Instruction Following	1457	±6.8	10,517
Creative Writing	1432	±9.1	5,061
Multi-turn	1458	±8.7	5,687
Hard Prompts	1487	±5.5	21,037
Hard Prompts (English)	1487	±7.0	10,286
Longer Query	1477	±6.6	13,360

Benchmark Details

Rows below have a displayable published verification record. Each source link and provenance note remains in the page HTML while its category is closed. Source-unverified manual rows and generated rows stay hidden.

Agentic22 benchmarks

Terminal-Bench 2.0Provider exact

66.7%Weighted 38%

Source: MoonshotAI: Kimi K2.6 tech blogProvenance: Provider exact

OSWorld-VerifiedProvider exact

73.1%Weighted 34%

Source: MoonshotAI: Kimi K2.6 tech blogProvenance: Provider exact

BrowseCompProvider exact

83.2%Weighted 28%

Source: MoonshotAI: Kimi K2.6 tech blogProvenance: Provider exact

ToolathlonProvider exact

50%Display only

Source: MoonshotAI: Kimi K2.6 tech blogProvenance: Provider exact

MCP AtlasProvider exact

55.9%Display only

Source: MoonshotAI: Kimi K2.6 model cardProvenance: MoonshotAI reports MCPMark at 55.9 on the Kimi K2.6 launch table. BenchLM maps this to the MCP Atlas display key.

Claw-EvalBenchmark exact

62.3%Display only

Source: Claw-Eval leaderboardProvenance: Claw-Eval reports this model as kimi_k26 in the official 2026-05-09 leaderboard snapshot. BenchLM stores the primary Pass^3 value on the local Claw-Eval display key.

DeepSearchQAProvider exact

92.5%Display only

Source: MoonshotAI: Kimi K2.6 tech blogProvenance: MoonshotAI highlights DeepSearchQA as an f1-score benchmark in the Kimi K2.6 tech blog and reports 92.5 on the benchmark table.

WideResearchProvider exact

80.8%Display only

Source: MoonshotAI: Kimi K2.6 model cardProvenance: MoonshotAI reports WideSearch at 80.8 item-f1 in the agentic evaluation table. BenchLM maps this to the Wide Research display key.

AA Agentic IndexReported

Artificial Analysis Agentic Index

30.3%Display only

Source: Artificial Analysis model benchmarksProvenance: Display-only row synced from the current Artificial Analysis model payload. It is excluded from BenchLM weighted scoring.

Tau2-TelecomReported

95.9%Display only

Source: Artificial Analysis: tau2-bench leaderboardProvenance: Display-only row synced from the current Artificial Analysis evaluation leaderboard. It is excluded from BenchLM weighted scoring.

GDPval-AAReported

GDPval-AA normalized

34.5%Display only

Source: Artificial Analysis model benchmarksProvenance: Display-only row synced from the current Artificial Analysis model payload. It is excluded from BenchLM weighted scoring.

GDPval-AAReported

1190Display only

Source: Artificial Analysis: gdpval-aa leaderboardProvenance: Display-only row synced from the current Artificial Analysis evaluation leaderboard. It is excluded from BenchLM weighted scoring.

APEX-Agents-AAReported

28.5%Display only

Source: Artificial Analysis: apex-agents-aa leaderboardProvenance: Display-only row synced from the current Artificial Analysis evaluation leaderboard. It is excluded from BenchLM weighted scoring.

Gert LabsBenchmark exact

Gert Labs Composite Game Benchmark

56.82%Display only

Source: Gert Labs rankingsProvenance: Gert Labs reports this composite leaderboard score in the public rankings API. BenchLM scales the source gscore from 0-1 to 0-100 and stores it as a display-only agentic benchmark.

ResearchClawBenchBenchmark exact

18.0%Display only

Source: ResearchClawBench leaderboardProvenance: ResearchClawBench reports this model as ResearchHarness (Kimi-K2.6) in the official Pass@1 leaderboard. BenchLM stores the one-decimal RADS average on the local ResearchClawBench display key and excludes it from weighted rankings.

OSWorld 2.0Benchmark exact

4.6%Display only

Source: OSWorld 2.0 paperProvenance: OSWorld 2.0 reports Kimi 2.6 single action on its 500-step main table. BenchLM stores the binary completion score and notes the corresponding partial score was 22.1%.

AA BriefcaseReported

Artificial Analysis Briefcase

809Display only

Source: Artificial Analysis: aa-briefcase leaderboardProvenance: Display-only row synced from the current Artificial Analysis evaluation leaderboard. It is excluded from BenchLM weighted scoring.

AA AutomationBenchReported

Artificial Analysis AutomationBench

19.6%Display only

Source: Artificial Analysis: automationbench-aa leaderboardProvenance: Display-only row synced from the current Artificial Analysis evaluation leaderboard. It is excluded from BenchLM weighted scoring.

AA EnterpriseOps-GymReported

Artificial Analysis EnterpriseOps-Gym

38.5%Display only

Source: Artificial Analysis: enterprise-ops-gym-aa leaderboardProvenance: Display-only row synced from the current Artificial Analysis evaluation leaderboard. It is excluded from BenchLM weighted scoring.

AA Harvey LABReported

Artificial Analysis Harvey LAB-AA

0.0%Display only

Source: Artificial Analysis: harvey-lab-aa leaderboardProvenance: Display-only row synced from the current Artificial Analysis evaluation leaderboard. It is excluded from BenchLM weighted scoring.

AA ITBenchReported

Artificial Analysis ITBench-AA

31.2%Display only

Source: Artificial Analysis: itbench-aa leaderboardProvenance: Display-only row synced from the current Artificial Analysis evaluation leaderboard. It is excluded from BenchLM weighted scoring.

AA Tau3 BankingReported

Artificial Analysis Tau3-Banking

20.6%Display only

Source: Artificial Analysis: tau3-banking leaderboardProvenance: Display-only row synced from the current Artificial Analysis evaluation leaderboard. It is excluded from BenchLM weighted scoring.

Coding13 benchmarks

LiveCodeBenchProvider exact

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

89.6%Weighted 27%

Source: MoonshotAI: Kimi K2.6 model cardProvenance: MoonshotAI reports LiveCodeBench (v6) at 89.6 in the coding evaluation table. BenchLM maps that exact value to both the core LiveCodeBench key and the display-only v6 slice.

SWE-bench ProProvider exact

58.6%Weighted 27%

Source: MoonshotAI: Kimi K2.6 model cardProvenance: Provider exact

SWE-bench VerifiedProvider exact

Software Engineering Benchmark Verified

80.2%Weighted 16%

Source: MoonshotAI: Kimi K2.6 model cardProvenance: Provider exact

SciCodeProvider exact

Scientific Code Benchmark

52.2%Weighted 10%

Source: MoonshotAI: Kimi K2.6 model cardProvenance: Provider exact

LiveCodeBench v6Provider exact

89.6%Display only

Source: MoonshotAI: Kimi K2.6 model cardProvenance: Provider exact

SWE MultilingualProvider exact

76.7%Display only

Source: MoonshotAI: Kimi K2.6 tech blogProvenance: Provider exact

Terminal-Bench 2.0Provider exact

66.7%Display only

Source: MoonshotAI: Kimi K2.6 tech blogProvenance: Provider exact

Vibe Code BenchBenchmark exact

Vibe Code Bench v1.1

37.89%Display only

Source: Vals AI: Vibe Code Bench v1.1Provenance: Vals Vibe Code Bench v1.1 reports this exact row under kimi/kimi-k2.6-thinking; BenchLM stores it on the local vibeCodeBench key.

cursorBench31Benchmark exact

47.6%Display only

Source: Cursor evals: CursorBench 3.1Provenance: Cursor reports Kimi 2.6 at this exact CursorBench 3.1 score on its public evals page. BenchLM stores it on the Kimi K2.6 row as a display-only coding-agent benchmark.

AA Coding IndexReported

Artificial Analysis Coding Index

61.8%Display only

Source: Artificial Analysis model benchmarksProvenance: Display-only row synced from the current Artificial Analysis model payload. It is excluded from BenchLM weighted scoring.

Terminal-Bench HardReported

43.9%Display only

Source: Artificial Analysis: terminalbench-hard leaderboardProvenance: Display-only row synced from the current Artificial Analysis evaluation leaderboard. It is excluded from BenchLM weighted scoring.

AA-SciCodeReported

Artificial Analysis SciCode

53.5%Display only

Source: Artificial Analysis: scicode leaderboardProvenance: Display-only row synced from the current Artificial Analysis evaluation leaderboard. It is excluded from BenchLM weighted scoring.

AA Terminal-Bench 2.1Reported

Artificial Analysis Terminal-Bench v2.1

65.9%Display only

Source: Artificial Analysis: terminalbench-v2-1 leaderboardProvenance: Display-only row synced from the current Artificial Analysis evaluation leaderboard. It is excluded from BenchLM weighted scoring.

Reasoning2 benchmarks

AA-LCRReported

Artificial Analysis Long Context Reasoning

69.7%Display only

Source: Artificial Analysis: artificial-analysis-long-context-reasoning leaderboardProvenance: Display-only row synced from the current Artificial Analysis evaluation leaderboard. It is excluded from BenchLM weighted scoring.

CritPtReported

Critical Physics Tasks

8.0%Display only

Source: Artificial Analysis: critpt leaderboardProvenance: Display-only row synced from the current Artificial Analysis evaluation leaderboard. It is excluded from BenchLM weighted scoring.

Knowledge10 benchmarks

HLEProvider exact

Humanity's Last Exam

34.7%Weighted 32%

Source: MoonshotAI: Kimi K2.6 model cardProvenance: MoonshotAI reports HLE-Full at 34.7 in the reasoning and knowledge evaluation table.

GPQAProvider exact

Graduate-Level Google-Proof Q&A

90.5%Weighted 5%

Source: MoonshotAI: Kimi K2.6 model cardProvenance: MoonshotAI reports GPQA-Diamond at 90.5 on the Kimi K2.6 launch table. BenchLM maps that exact value to the core GPQA key as well as the display-only GPQA-Diamond row.

GPQA-DProvider exact

GPQA Diamond

90.5%Display only

Source: MoonshotAI: Kimi K2.6 model cardProvenance: Provider exact

Artificial Analysis Intelligence IndexReported

44.2%Display only

Source: Artificial Analysis: artificial-analysis-intelligence-index leaderboardProvenance: Display-only row synced from the current Artificial Analysis evaluation leaderboard. It is excluded from BenchLM weighted scoring.

AA-GPQA DiamondReported

Artificial Analysis GPQA Diamond

91.1%Display only

Source: Artificial Analysis: gpqa-diamond leaderboardProvenance: Display-only row synced from the current Artificial Analysis evaluation leaderboard. It is excluded from BenchLM weighted scoring.

AA-HLEReported

Artificial Analysis Humanity's Last Exam

35.9%Display only

Source: Artificial Analysis: humanitys-last-exam leaderboardProvenance: Display-only row synced from the current Artificial Analysis evaluation leaderboard. It is excluded from BenchLM weighted scoring.

AA-Omniscience IndexReported

Artificial Analysis Omniscience Index

6.4%Display only

Source: Artificial Analysis: omniscience leaderboardProvenance: Display-only row synced from the current Artificial Analysis evaluation leaderboard. It is excluded from BenchLM weighted scoring.

AA-Omniscience AccuracyReported

Artificial Analysis Omniscience Accuracy

32.8%Display only

Source: Artificial Analysis model benchmarksProvenance: Display-only row synced from the current Artificial Analysis model payload. It is excluded from BenchLM weighted scoring.

AA-Omniscience Hallucination RateReported

Artificial Analysis Omniscience Hallucination Rate

39.3%Display only

Source: Artificial Analysis model benchmarksProvenance: Display-only row synced from the current Artificial Analysis model payload. It is excluded from BenchLM weighted scoring.

AA Openness IndexReported

Artificial Analysis Openness Index

33.3%Display only

Source: Artificial Analysis: artificial-analysis-openness-index leaderboardProvenance: Display-only row synced from the current Artificial Analysis evaluation leaderboard. It is excluded from BenchLM weighted scoring.

Math5 benchmarks

FrontierMath v2 (Tiers 1-3)Benchmark exact

FrontierMath v2 Tiers 1-3

38.966%Weighted 30%

Source: Epoch AI FrontierMath v2 leaderboardProvenance: Epoch AI reports FrontierMath v2 Tiers 1-3 at 38.966% for kimi-k2.6. BenchLM selects the highest published thinking effort for the model and stores the v2 benchmark slice separately.

AIME26Provider exact

AIME 2026

96.4%Weighted 25%

Source: MoonshotAI: Kimi K2.6 model cardProvenance: Provider exact

HMMT Feb 2026Provider exact

Harvard-MIT Mathematics Tournament February 2026

92.7%Weighted 25%

Source: MoonshotAI: Kimi K2.6 model cardProvenance: Provider exact

FrontierMath v2 (Tier 4)Benchmark exact

FrontierMath v2 Tier 4

14.580%Weighted 10%

Source: Epoch AI FrontierMath v2 leaderboardProvenance: Epoch AI reports FrontierMath v2 Tier 4 at 14.58% for kimi-k2.6. BenchLM selects the highest published thinking effort for the model and stores the v2 benchmark slice separately.

MMAnswerBenchProvider exact

86.0%Display only

Source: MoonshotAI: Kimi K2.6 model cardProvenance: Provider exact

Multimodal7 benchmarks

MMMU-ProProvider exact

Massive Multi-discipline Multimodal Understanding Pro

79.4%Weighted 45%

Source: MoonshotAI: Kimi K2.6 model cardProvenance: Provider exact

CharXivProvider exact

CharXiv Reasoning

80.4%Weighted 25%

Source: MoonshotAI: Kimi K2.6 model cardProvenance: Provider exact

MMMU-Pro w/ PythonProvider exact

MMMU-Pro with Python

80.1%Display only

Source: MoonshotAI: Kimi K2.6 model cardProvenance: Provider exact

MathVisionProvider exact

87.4%Display only

Source: MoonshotAI: Kimi K2.6 model cardProvenance: Provider exact

V*Provider exact

96.9%Display only

Source: MoonshotAI: Kimi K2.6 tech blogProvenance: MoonshotAI highlights V* with python at 96.9 on the Kimi K2.6 tech blog benchmark table.

AA-MMMU-ProReported

Artificial Analysis MMMU-Pro

79.4%Display only

Source: Artificial Analysis: mmmu-pro leaderboardProvenance: Display-only row synced from the current Artificial Analysis evaluation leaderboard. It is excluded from BenchLM weighted scoring.

Design Arena WebsiteReported

Design Arena Website Elo

1318Display only

Source: OpenRouter model benchmarksProvenance: Display-only Design Arena Website Elo synced from OpenRouter model benchmark metadata. It is excluded from BenchLM weighted scoring.

Inst. Following1 benchmark

AA-IFBenchReported

Artificial Analysis IFBench

76.0%Display only

Source: Artificial Analysis: ifbench leaderboardProvenance: Display-only row synced from the current Artificial Analysis evaluation leaderboard. It is excluded from BenchLM weighted scoring.

Kimi K2.6 Family

Base entry

Related Earlier Model

Kimi K2.5

Frequently Asked Questions

How does Kimi K2.6 perform overall in AI benchmarks?

Kimi K2.6 currently ranks #20 out of 79 models on BenchLM's provisional leaderboard with an overall score of 74. It also ranks #13 out of 35 on the verified leaderboard. It is created by Moonshot AI and features a 256K context window.

Is Kimi K2.6 good for knowledge and understanding?

Kimi K2.6 has visible benchmark coverage in knowledge and understanding, but BenchLM does not currently assign it a global category rank there.

Is Kimi K2.6 good for coding and programming?

Kimi K2.6 ranks #6 out of 79 models in coding and programming benchmarks with an average score of 88.1. It is among the top performers in this category.

Is Kimi K2.6 good for mathematics?

Kimi K2.6 ranks #2 out of 79 models in mathematics benchmarks with an average score of 71.8. It is among the top performers in this category.

Is Kimi K2.6 good for reasoning and logic?

Kimi K2.6 has visible benchmark coverage in reasoning and logic, but BenchLM does not currently assign it a global category rank there.

Is Kimi K2.6 good for agentic tool use and computer tasks?

Kimi K2.6 ranks #18 out of 79 models in agentic tool use and computer tasks benchmarks with an average score of 80.2. There are stronger options in this category.

Is Kimi K2.6 good for multimodal and grounded tasks?

Kimi K2.6 ranks #29 out of 79 models in multimodal and grounded tasks benchmarks with an average score of 68. There are stronger options in this category.

Is Kimi K2.6 good for instruction following?

Kimi K2.6 has visible benchmark coverage in instruction following, but BenchLM does not currently assign it a global category rank there.

Is Kimi K2.6 open source?

Yes, Kimi K2.6 is an open weight model created by Moonshot AI, meaning it can be downloaded and run locally or fine-tuned for specific use cases.

Does Kimi K2.6 have full benchmark coverage on BenchLM?

Not yet. Kimi K2.6 currently has 60 published benchmark scores out of the 296 benchmarks BenchLM tracks. BenchLM only exposes non-generated public benchmark rows, so missing categories stay blank until a sourced evaluation is available.

What is the context window size of Kimi K2.6?

Kimi K2.6 has a context window of 256K, which determines how much text it can process in a single interaction.

Related Resources

Last updated: July 13, 2026 · Runtime metrics stay blank until BenchLM has a sourced snapshot.

Don't miss the next GPT moment

Which models moved up, what is new, and what it costs. One email each week.

Free. One email per week.

Kimi K2.6

Evidence coverage

Evidence by category

Peer position

Category percentile

Category evidence

Self-host vs API cost

Chatbot Arena performance

Benchmark Details

Kimi K2.6 Family

Frequently Asked Questions

How does Kimi K2.6 perform overall in AI benchmarks?

Is Kimi K2.6 good for knowledge and understanding?

Is Kimi K2.6 good for coding and programming?

Is Kimi K2.6 good for mathematics?

Is Kimi K2.6 good for reasoning and logic?

Is Kimi K2.6 good for agentic tool use and computer tasks?

Is Kimi K2.6 good for multimodal and grounded tasks?

Is Kimi K2.6 good for instruction following?

Is Kimi K2.6 open source?

Does Kimi K2.6 have full benchmark coverage on BenchLM?

What is the context window size of Kimi K2.6?

Related Resources

Don't miss the next GPT moment