Which Chinese LLM is best in 2026? We rank Kimi K2.5, DeepSeek V3.2, Qwen3.5, GLM-5, MiMo, MiniMax M2.7, and more by benchmarks — coding, math, reasoning, and agentic tasks.
Share This Report
Copy the link, post it, or save a PDF version.
Chinese labs shipped more frontier-class models in the last six months than in all of 2024. Kimi K2.5 from Moonshot AI matches or beats several Western frontier models on coding benchmarks. GLM-5 from Zhipu AI posts near-perfect math scores. Alibaba's Qwen3.5 and DeepSeek's V3.2 keep pushing the open-weight frontier forward. And newer entrants — Xiaomi's MiMo, MiniMax M2.7, ByteDance Seed — are filling out the competitive landscape.
This guide focuses on Chinese text models with enough benchmark coverage to compare meaningfully, then calls out sparse-data outliers separately. It compares the field head-to-head against Western frontier models and breaks down which model wins for coding, math, reasoning, and agentic tasks. All scores come from the BenchLM.ai leaderboard — updated as new benchmarks are published.
| Rank | Model | Creator | Score | Type | Open Weight | Context |
|---|---|---|---|---|---|---|
| 1 | Kimi K2.5 (Reasoning) | Moonshot AI | 67 | Reasoning | No | 128K |
| 1 | Qwen2.5-1M | Alibaba | 67 | Non-Reasoning | Yes | 1M |
| 3 | DeepSeek V3.2 (Thinking) | DeepSeek | 66 | Reasoning | Yes | 128K |
| 3 | DeepSeek Coder 2.0 | DeepSeek | 66 | Non-Reasoning | Yes | 128K |
| 5 | Qwen3.5 397B (Reasoning) | Alibaba | 63 | Reasoning | Yes | 128K |
| 6 | Qwen3.5 397B | Alibaba | 60 | Non-Reasoning | Yes | 128K |
| 7 | GLM-5 (Reasoning) | Zhipu AI | 59 | Reasoning | Yes | 200K |
| 8 | DeepSeek V3.2 | DeepSeek | 58 | Non-Reasoning | Yes | 128K |
| 8 | MiMo-V2-Flash | Xiaomi | 58 | Reasoning | Yes | 256K |
| 10 | MiniMax M2.7 | MiniMax | 57 | Non-Reasoning | No | 200K |
Scores are a normalized weighted average across 8 benchmark categories. See the full ranking at /best/chinese-models.
Two things stand out. First, 8 of this top 10 are open weight — you can download the weights and self-host. Second, seven different labs appear once you extend the list to 15 broadly benchmarked text models. The Chinese AI ecosystem is not a one-company story.
Among broadly benchmarked Chinese LLMs, the best score is 67 overall. For context, here's how that stacks up:
| Model | Creator | Score | Arena Elo |
|---|---|---|---|
| Gemini 3.1 Pro | 83 | — | |
| GPT-5.4 | OpenAI | 80 | — |
| Claude Opus 4.6 | Anthropic | 76 | — |
| Claude Sonnet 4.6 | Anthropic | 76 | — |
| Kimi K2.5 (Reasoning) | Moonshot AI | 67 | 1447 |
| Qwen2.5-1M | Alibaba | 67 | 1256 |
| DeepSeek V3.2 (Thinking) | DeepSeek | 66 | 1421 |
| Qwen3.5 397B (Reasoning) | Alibaba | 63 | 1450 |
The overall gap is real — 13+ points behind the top Western models. But overall scores hide category-level strengths. On math benchmarks, GLM-5 (Reasoning) outscores every model on the leaderboard. On coding, Kimi K2.5 is competitive with Claude and GPT. The gap is widest on multimodal and instruction-following tasks.
On Chatbot Arena, Chinese models tell a different story. GLM-5 (Reasoning) sits at Elo 1451 and Qwen3.5 397B (Reasoning) at 1450 — competitive with the top Western models in human preference rankings. The disconnect between Arena Elo and benchmark scores suggests Chinese models may be stronger on conversational tasks than standardized benchmarks capture.
| Model | SWE-bench Verified | SWE-bench Pro | LiveCodeBench |
|---|---|---|---|
| Kimi K2.5 (Reasoning) | 76.8 | 70 | 55 |
| MiMo-V2-Flash | 73.4 | 52 | — |
| Kimi K2.5 | 76.8 | 40 | 55 |
| DeepSeek Coder 2.0 | 65 | 50 | 40 |
| GLM-5 (Reasoning) | 62 | 67 | 49 |
| Qwen3.5 397B (Reasoning) | 60 | 65 | 50 |
| MiniMax M2.7 | — | 56.22 | — |
Kimi K2.5 is the clear coding leader. SWE-bench Verified 76.8 puts it in the same tier as GPT-5.4 Pro and Claude Opus 4.6 — remarkable for a Chinese model that most Western developers haven't heard of. Moonshot AI has invested heavily in code-specific training, and it shows.
MiMo-V2-Flash from Xiaomi is the surprise here. At SWE-bench Verified 73.4 as an open-weight model with 256K context, it's a strong option for teams that need to self-host a coding assistant.
MiniMax M2.7 has limited benchmark coverage (only 11 benchmarks total) but posts a solid SWE-bench Pro 56.22 at aggressive pricing — useful for budget coding workloads.
| Model | AIME 2025 | HMMT 2025 | MATH 500 |
|---|---|---|---|
| GLM-5 (Reasoning) | 98 | 95 | 92 |
| Kimi K2.5 (Reasoning) | 96.1 | 95.4 | 92 |
| MiMo-V2-Flash | 94.1 | 76 | 90 |
| Qwen3.5 397B (Reasoning) | 94 | 90 | 93 |
| DeepSeek V3.2 (Thinking) | 88 | 84 | 84 |
| Qwen2.5-1M | 86 | 82 | 83 |
GLM-5 (Reasoning) from Zhipu AI posts AIME 2025 at 98 and HMMT 2025 at 95 — among the highest math scores on the entire BenchLM.ai leaderboard, including Western models. Chinese labs have consistently pushed math capability, and GLM-5 represents the current ceiling.
Kimi K2.5 (Reasoning) is close behind at AIME 96.1 and HMMT 95.4. The math race between Zhipu and Moonshot is tight.
MiMo-V2-Flash posts an interesting split: AIME 2025 at 94.1 (strong) but HMMT 2025 at only 76 (a 18-point gap from the leaders). This suggests MiMo may be specifically optimized for AIME-style problems.
| Model | Terminal-Bench 2.0 | BrowseComp | OSWorld-Verified |
|---|---|---|---|
| GLM-5 (Reasoning) | 81 | 80 | 74 |
| Qwen3.5 397B (Reasoning) | 77 | 78 | 70 |
| DeepSeek V3.2 (Thinking) | 71 | 70 | 67 |
| Qwen2.5-1M | 65 | 72 | 59 |
| MiMo-V2-Flash | 63 | 65 | 58 |
| MiniMax M2.7 | 57 | — | — |
| Kimi K2.5 (Reasoning) | 50.8 | 60.6 | 63.3 |
Among broadly benchmarked Chinese text models, GLM-5 (Reasoning) dominates agentic benchmarks with Terminal-Bench 81, BrowseComp 80, and OSWorld 74. These are globally competitive scores — GPT-5.4 scores 85 on OSWorld, meaning GLM-5 is within 11 points of the absolute frontier.
An interesting contrast: Kimi K2.5 leads in coding but trails in agentic tasks (Terminal-Bench 50.8 vs GLM-5's 81). This reflects different model design priorities — Kimi is optimized for code generation while GLM-5 is built for broader tool use and computer interaction.
The biggest differentiator for Chinese LLMs isn't raw scores — it's access. Here's the open-weight landscape:
| Model | Creator | Score | Weights Available |
|---|---|---|---|
| Qwen2.5-1M | Alibaba | 67 | Yes |
| DeepSeek V3.2 (Thinking) | DeepSeek | 66 | Yes |
| DeepSeek Coder 2.0 | DeepSeek | 66 | Yes |
| Qwen3.5 397B (Reasoning) | Alibaba | 63 | Yes |
| Qwen3.5 397B | Alibaba | 60 | Yes |
| GLM-5 (Reasoning) | Zhipu AI | 59 | Yes |
| DeepSeek V3.2 | DeepSeek | 58 | Yes |
| MiMo-V2-Flash | Xiaomi | 58 | Yes |
| Kimi K2.5 | Moonshot AI | 56 | Yes |
Nine of the top 11 broadly benchmarked Chinese text models are open weight. For comparison, none of GPT-5.4, Claude Opus 4.6, or Gemini 3.1 Pro offer downloadable weights. If your use case requires self-hosting, fine-tuning, or full control over the inference stack, Chinese open-weight models are the strongest available option.
DeepSeek V3.2 (Thinking) at score 66 is the highest-scoring open-weight reasoning model from any lab. Qwen2.5-1M at 67 is the highest-scoring open-weight non-reasoning model with a 1M context window — no Western model matches both the score and context length in open weight form.
Kimi K2.5 is Moonshot AI's flagship. The reasoning variant scores 67 overall — tied for the highest Chinese model score. Kimi's strength is coding: SWE-bench 76.8 is elite by any standard. The base Kimi K2.5 (open weight, score 56) shares the same coding scores but drops on reasoning and math. Moonshot also maintains the older Kimi K2 (score 26) and Moonshot v1 (score 44).
DeepSeek has the broadest lineup. V3.2 (Thinking) at 66 and the base V3.2 at 58 are the latest. DeepSeek Coder 2.0 at 66 targets code-heavy workflows. The older DeepSeek-R1 (44) pioneered open-weight reasoning but has been eclipsed by V3.2. DeepSeek V3 (54) and V3.1 (33) remain available. All DeepSeek models are open weight.
Alibaba covers two product lines. Qwen2.5-1M (67) is the long-context specialist — 1M tokens at high quality. Qwen3.5 397B (60/63 with reasoning) is the large parameter model. The older Qwen3 235B (45/52) and Qwen2.5-72B (51) round out the lineup. All open weight.
GLM-5 (Reasoning) at 59 is the math and agentic champion — AIME 98, Terminal-Bench 81. The non-reasoning GLM-5 scores 49. GLM-4.7 (51) and GLM-4.7-Flash (47) are smaller, faster alternatives. GLM-5 is open weight; GLM-4.5 and GLM-4.5-Air are proprietary.
A newcomer to frontier AI. MiMo-V2-Flash (58, open weight) is the highlight — strong math (AIME 94.1) and coding (SWE-bench 73.4) in a 256K context model. MiMo-V2-Pro scores 84 overall but with only 3 benchmarks — too sparse to rank reliably. MiMo-V2-Omni (76, 2 benchmarks) is similarly data-limited.
MiniMax M2.7 (57) focuses on coding at aggressive pricing. Only 11 benchmarks published, but SWE-bench Pro 56.22 is solid. MiniMax M2.5 (44) is the older model with fuller coverage.
Seed models cluster in the 40–49 range. Seed 1.6 (49) and Seed-2.0-Lite (47) are the best performers. All proprietary with 256K context. ByteDance has not pushed into the frontier tier that DeepSeek and Alibaba occupy.
Self-hosted coding assistant — Kimi K2.5 (open weight, SWE-bench 76.8) or MiMo-V2-Flash (open weight, SWE-bench 73.4, 256K context). Both are strong enough for production code review and generation.
Math and science — GLM-5 (Reasoning). AIME 98 and GPQA 94 make it the top choice for math-heavy and science-heavy workloads from any Chinese lab.
Long-context processing — Qwen2.5-1M. 1M context at score 67, open weight. No other Chinese model combines this context length with this level of quality.
Budget coding API — MiniMax M2.7. Limited benchmarks but strong coding scores at very competitive pricing for teams that don't need to self-host.
Best all-rounder — Kimi K2.5 (Reasoning) at score 67. The strongest overall Chinese model with broad benchmark coverage across coding, math, reasoning, and knowledge.
AI agent building — GLM-5 (Reasoning). Terminal-Bench 81 and OSWorld 74 are the best agentic scores from any Chinese model, and competitive with Western frontier models.
Not all scores are created equal. On the raw leaderboard, MiMo-V2-Pro tops Chinese models at 84 overall — but that's based on only 3 benchmarks (GPQA, SWE-bench, Terminal-Bench). MiMo-V2-Omni scores 76 on just 2 benchmarks. These scores are useful as signals but unreliable as rankings.
By contrast, Kimi K2.5, DeepSeek V3.2, Qwen3.5, and GLM-5 all have 31–32 benchmarks each — giving much higher confidence in their overall scores. When choosing a model, benchmark breadth matters as much as the top-line number. BenchLM.ai's confidence indicator (1–4 dots) reflects how much verified data supports each score.
MiniMax M2.7 sits in between with 11 benchmarks. The coding and agentic scores that exist are strong, but the missing categories (math, knowledge, instruction-following) make it risky to recommend for general-purpose use without testing.
Chinese labs are shipping at an accelerating pace. Key trends to watch:
The open-weight default. Most Chinese frontier models launch with downloadable weights. This is a structural advantage for the ecosystem — it enables fine-tuning, distillation, and self-hosting that closed Western models don't allow.
Specialization over generalization. Kimi optimizes for code. GLM-5 dominates math. MiMo targets efficiency. Rather than one model that does everything, Chinese labs are increasingly building models with clear category strengths.
The Xiaomi factor. A consumer electronics company shipping competitive AI models (MiMo-V2-Flash at score 58) signals that frontier AI capability is diffusing beyond traditional AI labs. MiMo-V2-Pro and MiMo-V2-Omni need more benchmark data, but early scores suggest Xiaomi is serious.
Check the full rankings at /best/chinese-models for the latest scores as new benchmarks are published. For comparisons: Kimi K2.5 vs DeepSeek V3.2 | GLM-5 vs Qwen3.5 | DeepSeek V3.2 vs GPT-5.4.
Get weekly benchmark updates in your inbox.
Free. No spam. Unsubscribe anytime. We only store derived location metadata for consent routing.
Share This Report
Copy the link, post it, or save a PDF version.
On this page
New model releases, benchmark scores, and leaderboard changes. Every Friday.
Free. Your signup is stored with a derived country code for compliance routing.
State of LLM benchmarks in 2026: top AI model rankings, category leaders, benchmark trends, open vs closed performance, pricing context, and methodology from BenchLM.
Which budget LLM should you use in 2026? We rank GPT-5.4 mini, GPT-5.4 nano, MiniMax M2.7, Claude Haiku 4.5, Gemini Flash, DeepSeek, and more by benchmarks and price.
Which AI model is best for coding in 2026? We rank major LLMs by SWE-bench Pro and LiveCodeBench, with SWE-bench Verified shown as a historical baseline — plus pricing and use-case guidance.