rankingbenchmarkscomparisonguidellm

State of LLM Benchmarks 2026: Rankings, Trends, and What Actually Changed

State of LLM benchmarks in 2026: top AI model rankings, category leaders, benchmark trends, open vs closed performance, pricing context, and methodology from BenchLM.

Glevd·March 22, 2026·17 min read

Share This Report

Copy the link, post it, or save a PDF version.

Share on XShare on LinkedIn

Gemini 3.1 Pro is the current #1 model on BenchLM's overall leaderboard with a score of 83. GPT-5.4 is #2 at 80. Claude Opus 4.6, Claude Sonnet 4.6, and GPT-5.4 Pro cluster behind them at 76. That is the headline. The more important point is why the ranking looks like this: in 2026, the best models are no longer separated by one generic idea of "intelligence." They are separated by coverage, benchmark mix, and category-specific strengths.

The benchmark landscape also changed. MMLU is effectively saturated. HumanEval is close to solved at the frontier. The benchmarks that still matter are the ones that force spread: HLE for hard knowledge, MMLU-Pro and GPQA for nontrivial factual reasoning, SWE-bench Pro and LiveCodeBench for coding, Terminal-Bench 2.0 and OSWorld-Verified for agents, and MMMU-Pro plus OfficeQA Pro for multimodal work. That shift matters more than any single model launch.

All data below reflects the current BenchLM dataset, last updated March 18, 2026, covering 135 models.

Key findings

  • Gemini 3.1 Pro is the most balanced frontier model right now. It ranks #1 overall at 83 because it is strong across all eight tracked categories, not because it dominates a single benchmark family.
  • GPT-5.4 Pro is the strongest narrow specialist, not the strongest overall model. It leads reasoning (95), knowledge (96.3), instruction following (97), and math (98.3), but it has much narrower published coverage than Gemini 3.1 Pro or GPT-5.4.
  • Coding and agentic benchmarks still create the most useful separation. SWE-bench Pro has a 14-point gap from first to fifth place. Terminal-Bench 2.0 has a 15.7-point gap from first to fifth. MMLU has a 0-point gap across the top five.
  • Open-weight models are now credible top-20 entrants. Qwen2.5-1M ranks #9 overall. DeepSeek V3.2 (Thinking) ranks #16. DeepSeek Coder 2.0 ranks #17. That is not frontier parity, but it is real competitive pressure.
  • No single model owns every category. Gemini 3.1 Pro leads multimodal. GPT-5.4 leads OfficeQA Pro and OSWorld-Verified. Claude Opus 4.6 leads HLE and ties for multilingual leadership. GPT-5.4 Pro leads reasoning and math. The frontier is fragmented.
  • Coverage quality is now part of the ranking story. Models with exceptional but sparse benchmark rows can look unbeatable in narrow views and still lose the overall table because BenchLM rewards cross-category evidence.
  • Price-performance matters more than raw peak performance for most teams. Gemini 3.1 Pro at $1.25 / $5 and GPT-5.4 at $2.50 / $15 are far easier to justify for broad production use than GPT-5.4 Pro at $30 / $180 or Claude Opus 4.6 at $15 / $75.

The overall leaderboard

BenchLM's overall score weights eight categories: agentic (22%), coding (20%), reasoning (17%), multimodal grounded (12%), knowledge (12%), multilingual (7%), instruction following (5%), and math (5%). That weighting matters. A model with one elite skill and missing category coverage can lose to a model that is merely excellent everywhere.

Top 20 models overall

Rank Model Creator Overall Strongest category Notes
1 Gemini 3.1 Pro Google 83 Multimodal / Math Broadest elite all-rounder with full cross-category coverage
2 GPT-5.4 OpenAI 80 Instruction following / Reasoning Broad coverage, strongest all-purpose OpenAI general model
3 Claude Opus 4.6 Anthropic 76 Math / Multilingual Excellent depth, especially HLE and multilingual
4 Claude Sonnet 4.6 Anthropic 76 Math / Multimodal Broader coverage than Opus, weaker coding profile
5 GPT-5.4 Pro OpenAI 76 Math / Knowledge / Reasoning Best specialist profile, but narrower published coverage
6 Gemini 3 Pro Deep Think Google 70 Math / Agentic Strong on BrowseComp, but not a complete all-category row
7 GPT-5.3 Codex OpenAI 70 Math / Reasoning / Coding One of the strongest coding-focused models on BenchLM
8 o3-mini OpenAI 70 Instruction following / Reasoning Strong reasoning package with weaker coding depth
9 Qwen2.5-1M Alibaba 67 Math / Reasoning Best open-weight overall model on BenchLM right now
10 Grok 4 xAI 67 Math / Reasoning Strong coding and multimodal row, weaker agentic profile
11 GPT-5.2 OpenAI 67 Math / Knowledge Strong older OpenAI row, weaker agentic coverage
12 Kimi K2.5 (Reasoning) Moonshot AI 67 Multilingual / Math Excellent coding and multilingual signals, sparser breadth
13 GPT-4.1 OpenAI 67 Instruction following Broadly competent, no longer frontier-leading in coding
14 o1 OpenAI 67 Instruction following Reasoning-first model with weak coding relative to 2026 leaders
15 Grok 4.1 xAI 67 Math / Knowledge / Reasoning Very strong partial row, but far narrower benchmark footprint
16 DeepSeek V3.2 (Thinking) DeepSeek 66 Math / Multilingual Best open-weight reasoning-agentic compromise from DeepSeek
17 DeepSeek Coder 2.0 DeepSeek 66 Instruction following / Math Strong open coding option, still well behind top closed models overall
18 o3 OpenAI 63 Math Competitive reasoning line, weak coding row
19 Nemotron 3 Ultra 500B NVIDIA 63 Instruction following Open-weight scale play with weak coding depth
20 Qwen3.5 397B (Reasoning) Alibaba 63 Math / Reasoning Strong narrow reasoning profile, less cross-category evidence

Why Gemini 3.1 Pro is #1

Gemini 3.1 Pro does not win because it has the single highest ceiling anywhere. It wins because it almost never drops.

Its category profile is unusually balanced:

  • Agentic: 76.1
  • Coding: 72.0
  • Reasoning: 88.3
  • Multimodal grounded: 95.0
  • Knowledge: 80.7
  • Multilingual: 94.1
  • Instruction following: 95.0
  • Math: 97.1

That is the kind of row that survives any reasonable weighting system. It also has benchmark coverage across all eight categories, which matters when overall scores are meant to represent real deployment breadth rather than a lab highlight reel.

The strongest part of the Gemini case is multimodal plus long-context utility. It leads the weighted multimodal category at 95, ties the top MMMU-Pro score at 95, and remains competitive on OfficeQA Pro at 95. It also holds up unusually well in multilingual and math. If you want a single general-purpose model with few obvious holes, Gemini 3.1 Pro has the cleanest argument on current data.

Why GPT-5.4 is still the other real contender

GPT-5.4 ranks #2 overall at 80 and is probably the strongest "safe default" alternative to Gemini 3.1 Pro.

Its broad row is difficult to dismiss:

  • Agentic: 77.0
  • Coding: 72.8
  • Reasoning: 89.9
  • Multimodal grounded: 87.9
  • Knowledge: 83.1
  • Multilingual: 94.0
  • Instruction following: 96.0

It also leads some of the benchmarks that matter most in actual product workflows. GPT-5.4 is #1 on OfficeQA Pro at 96 and #1 on OSWorld-Verified at 75. That matters because those are not toy tasks. They reward models that can operate over messy interfaces, documents, screenshots, and workflow-like inputs.

The only obvious hole is math coverage. GPT-5.4 has no current math row in BenchLM's dataset, which keeps it from challenging GPT-5.4 Pro on pure specialist strength. But for teams that care more about breadth than benchmark perfection in a narrow category, GPT-5.4 remains a very strong all-around production option.

The specialist tier: GPT-5.4 Pro, GPT-5.3 Codex, and Claude

GPT-5.4 Pro is the strongest model on several individual category tables:

  • Reasoning: 95.0
  • Knowledge: 96.3
  • Instruction following: 97.0
  • Math: 98.3
  • SWE-bench Verified: 86

If you only looked at these rows, you would assume GPT-5.4 Pro is the clear overall winner. It is not. The reason is simple: the current published row is narrower. GPT-5.4 Pro has no current agentic, multimodal, or multilingual category eligibility in BenchLM's data. BenchLM's overall leaderboard rewards demonstrated breadth. That is the correct choice for a production ranking, even if it is less flattering to narrow specialist leaders.

GPT-5.3 Codex is the same story in a more coding-centric form. It ranks #7 overall, but it remains one of the strongest coding models in the dataset:

  • Coding category: 85
  • SWE-bench Verified: 85
  • Reasoning category: 93
  • Math category: 97.6

For teams choosing an API for coding agents or software tooling, GPT-5.3 Codex is arguably more interesting than its overall rank suggests.

Anthropic's frontier is split between two models:

  • Claude Opus 4.6 brings the stronger deep-knowledge and multilingual profile. It leads HLE at 53 and ties for the best multilingual category score at 96.
  • Claude Sonnet 4.6 is the broader all-purpose Anthropic row, especially strong in multimodal with a weighted score of 91.9 and a tied #1 MMMU-Pro score of 95.

The important takeaway is that the frontier is no longer winner-take-all. Gemini leads the most balanced overall package. GPT has the strongest specialist lines in several categories. Anthropic still owns some of the hardest knowledge and multilingual rows. Model selection is now more about use case fit than abstract prestige.

Category leaders

The overall leaderboard is useful, but it hides where the real separation lives. These are the category tables that matter most.

Coding

BenchLM weights coding as follows:

That weighting is important because it reflects what changed in 2026. HumanEval still exists, but it is no longer the main signal for frontier coding models.

Top coding category scores:

Rank Model Coding score
1 GPT-5.4 Pro 86.0
2 GPT-5.3 Codex 85.0
3 Kimi K2.5 (Reasoning) 82.9
4 Kimi K2.5 82.9
5 Claude Opus 4.5 80.9

The benchmark-level picture is more fragmented:

  • SWE-bench Verified leader: GPT-5.4 Pro (86)
  • SWE-bench Pro leader: Gemini 3.1 Pro (72)
  • LiveCodeBench leader: Kimi K2.5 (Reasoning) and Kimi K2.5 (85)

That is a good example of why single-benchmark arguments are weak. If you care about repository bug fixing, GPT-5.4 Pro and GPT-5.3 Codex look best. If you care about fresh coding tasks, Kimi matters more. If you want the broadest balanced general-purpose model that still holds up on coding, Gemini 3.1 Pro remains competitive.

Agentic

Agentic is now the heaviest category in BenchLM's formula at 22%, ahead of coding. That is a reasonable reflection of where the market is going.

Top agentic category scores:

Rank Model Agentic score
1 MiMo-V2-Pro 86.7
2 Gemini 3 Pro Deep Think 78.8
3 GPT-5.4 77.0
4 Gemini 3.1 Pro 76.1
5 Claude Opus 4.6 72.6

Two caveats matter here:

  1. MiMo-V2-Pro is not a clean overall winner story. It has only three benchmark results in BenchLM's current dataset, so its #1 agentic placement should be read as a narrow signal, not a full model judgment.
  2. The benchmark leaders split by task type. MiMo-V2-Pro leads Terminal-Bench 2.0 at 86.7. Gemini 3 Pro Deep Think leads BrowseComp at 87. GPT-5.4 leads OSWorld-Verified at 75.

The bigger story is that agentic benchmarks still have real spread. Terminal-Bench 2.0 has a 15.7-point gap between first and fifth. BrowseComp has a 10-point gap. These are not saturated tests.

Reasoning and knowledge

Reasoning and knowledge are where the old and new benchmark worlds collide.

Top reasoning category scores:

Rank Model Reasoning score
1 GPT-5.4 Pro 95.0
2 GPT-5.3 Codex 93.0
3 Grok 4.1 93.0
4 GPT-5.4 89.9
5 Gemini 3.1 Pro 88.3

Top knowledge category scores:

Rank Model Knowledge score
1 GPT-5.4 Pro 96.3
2 Grok 4.1 95.6
3 Claude Opus 4.5 95.0
4 GPT-5.2-Codex 95.0
5 Gemini 3 Pro 95.0

At first glance those knowledge scores look tightly clustered, and that is exactly the point. Some older knowledge benchmarks are no longer doing useful work. On MMLU, the top five models all score 99. That benchmark is now a floor check, not a frontier separator.

The better 2026 signals are:

  • MMLU-Pro: top 94, fifth 90
  • GPQA: top 99, fifth 92.8
  • HLE: top 53, fifth 40

HLE is the clearest knowledge separator right now. Claude Opus 4.6 leads it at 53, ahead of Claude Sonnet 4.6 at 49 and GPT-5.4 at 48. If you want one benchmark that still tells you something meaningful about the hard edge of frontier knowledge, HLE has the strongest argument.

Multimodal grounded

This is one of the biggest strategic categories in 2026 because enterprise and agent workflows increasingly involve screenshots, spreadsheets, documents, dashboards, and mixed inputs rather than plain text.

Top multimodal category scores:

Rank Model Multimodal score
1 Gemini 3.1 Pro 95.0
2 Claude Sonnet 4.6 91.9
3 GPT-5.4 87.9
4 Gemini 3 Pro 81.0
5 Claude 4 Sonnet 79.7

The benchmark split is useful:

  • MMMU-Pro leaders: Gemini 3.1 Pro (95) and Claude Sonnet 4.6 (95)
  • OfficeQA Pro leader: GPT-5.4 (96)

That means the multimodal story is not just "Gemini wins." Gemini wins the weighted category because it is elite on both tests. But GPT-5.4 deserves explicit credit for being best on the more office-workflow-shaped benchmark.

Open-weight vs closed: the gap is smaller, but still real

Open-weight models are now good enough that the conversation changed. They are no longer interesting only as cheap substitutes. Some are credible choices.

The strongest open-weight overall rows on BenchLM right now are:

Rank Model Overall Notes
1 Qwen2.5-1M 67 Best open-weight overall model on BenchLM
2 DeepSeek V3.2 (Thinking) 66 Strong reasoning-agentic compromise
3 DeepSeek Coder 2.0 66 Best open coding profile in the top open-weight tier
4 Nemotron 3 Ultra 500B 63 Strong scale and instruction-following profile
5 Qwen3.5 397B (Reasoning) 63 Narrow reasoning strength, thinner breadth

This is real progress. Qwen2.5-1M ranking #9 overall means the open-weight tier is no longer shut out of the frontier conversation.

But the gap is still visible in three places:

  • Coding depth: Qwen2.5-1M's coding score is 44.9. DeepSeek Coder 2.0 improves that to 52.7, but both trail the 80+ closed-model leaders.
  • Multimodal quality: the top closed models are still materially ahead on MMMU-Pro and OfficeQA Pro.
  • Broad category consistency: the best closed models stack high scores across more categories at once.

The practical version is simple: if you need the absolute best all-purpose model, you still end up in the proprietary tier. If you need a strong open-weight system, the open tier is now good enough that the tradeoff is technical, not symbolic.

The benchmarks that matter now

The clearest shift in 2026 is not just which model is first. It is which benchmarks are still worth trusting as primary frontier signals.

Benchmark spread snapshot

Benchmark Top score 5th place 10th place What it tells you
MMLU 99 99 90.8 Effectively saturated at the top
MMLU-Pro 94 90 84 Still useful for broad knowledge separation
HLE 53 40 18.8 Strongest hard-knowledge spread
SWE-bench Pro 72 58 55 Real coding separation remains
LiveCodeBench 85 80 63.6 Fresh coding tasks still create a long tail
Terminal-Bench 2.0 86.7 71 63 One of the clearest agentic separators
BrowseComp 87 77 72 Strong research-agent benchmark

Weakening signals

MMLU has effectively stopped differentiating elite knowledge models. Top five score: 99, 99, 99, 99, 99.

HumanEval still provides some signal, but far less than it used to. Top five score: 99, 95, 95, 95, 93. That is better than MMLU, but still weak compared with harder coding tests.

Stronger frontier signals

HLE remains one of the best high-end knowledge filters because the spread is still large: top score 53, fifth score 40, tenth score 18.8.

SWE-bench Pro remains more useful than legacy coding tests because the spread is still large: top score 72, fifth 58, tenth 55.

LiveCodeBench is valuable because it uses fresher tasks and still shows a meaningful long tail: top score 85, tenth 63.6.

Terminal-Bench 2.0 is one of the clearest non-saturated agentic benchmarks in the set: top score 86.7, fifth 71, tenth 63.

BrowseComp is a strong research-agent benchmark with a meaningful spread: top 87, fifth 77.

OSWorld-Verified is slightly tighter at the top, but still useful because it measures real interface work rather than stylized QA.

The broader point is that benchmark selection has become a methodological problem. If your ranking still leans heavily on MMLU and HumanEval, it will systematically overstate certainty and understate real product differences.

Price versus performance

The frontier no longer has one obvious value story. It has tiers.

Broad production tier

  • Gemini 3.1 Pro: $1.25 / $5
  • GPT-5.4: $2.50 / $15
  • Claude Sonnet 4.6: $3 / $15

These are the models that make the most sense for teams that need broad capability without paying flagship premiums.

Specialist premium tier

  • Claude Opus 4.6: $15 / $75
  • GPT-5.4 Pro: $30 / $180

These prices are hard to justify unless you specifically need the incremental specialist gains. GPT-5.4 Pro does earn its place on reasoning, knowledge, instruction following, and math. Claude Opus 4.6 does earn its place on HLE and multilingual. But most teams are not buying a benchmark trophy. They are buying a model that has to survive production economics.

Coding-specific value case

  • GPT-5.3 Codex: $2.50 / $10

This remains one of the cleanest price-performance stories in the whole market. It is a top-tier coding model by BenchLM's weighted coding view, while staying far cheaper than the flagship specialist tier.

What changed from 2025 to 2026

The biggest change is not that one lab pulled ahead forever. It is that the benchmark stack got harsher.

2025 framing vs 2026 framing

Area 2025 emphasis 2026 emphasis Why it matters
Knowledge ranking MMLU-heavy discussions MMLU-Pro, GPQA, HLE Harder tests still create spread
Coding quality HumanEval and legacy SWE-bench citations SWE-bench Pro and LiveCodeBench Better match for real software work
Agent capability Demo-driven claims Terminal-Bench 2.0, BrowseComp, OSWorld-Verified Action-based evaluation is harder to fake
Multimodal strength Vision as a side feature MMMU-Pro and OfficeQA Pro Grounded document and UI tasks now matter
Leaderboard logic Best single score wins headlines Breadth and coverage matter more Sparse rows should not decide the full table

Compared with the older evaluation mix that centered more attention on MMLU, HumanEval, and broad chat impressions, 2026 is much more defined by:

That shift makes the leaderboard harder to game and harder to summarize lazily. It also explains why some models look better in current serious ranking systems than they did in older, more benchmark-saturated discussions.

The second change is that open-weight systems are now close enough to matter strategically, even when they are not yet winning overall. The top open models have entered the main table. They are no longer living in a separate hobbyist category.

The third change is that coverage matters more than ever. In a world where model labs selectively publish benchmark rows, the best ranking system is not the one that rewards the loudest claim. It is the one that discounts incomplete evidence.

Methodology

This report uses BenchLM's current public ranking system and benchmark dataset.

Data scope

  • Dataset last updated: March 18, 2026
  • Total tracked models: 135
  • Overall score uses eight categories
  • Models need sufficient published benchmark evidence to rank meaningfully across the site

Overall category weights

  • Agentic: 22%
  • Coding: 20%
  • Reasoning: 17%
  • Multimodal grounded: 12%
  • Knowledge: 12%
  • Multilingual: 7%
  • Instruction following: 5%
  • Math: 5%

Examples of within-category weighting

Why sparse coverage matters

Some models have outstanding narrow rows with limited total coverage. GPT-5.4 Pro is the best example among current frontier proprietary models. MiMo-V2-Pro is the clearest example in agentic. These are real signals, but they are not enough by themselves to justify an undisputed overall #1 position. BenchLM's leaderboard is intentionally conservative here.

That is the right editorial stance for a source-of-record product. If the data is partial, the claim should also be partial.

Bottom line

If you want the most balanced general model in BenchLM's current data, pick Gemini 3.1 Pro.

If you want the strongest all-purpose OpenAI model with broad product usefulness, pick GPT-5.4.

If you want the strongest narrow specialist row on reasoning, knowledge, and math, look at GPT-5.4 Pro, but read the coverage caveat.

If you want the strongest coding-focused value play, look at GPT-5.3 Codex.

If you want the best open-weight all-rounder, start with Qwen2.5-1M.

The larger conclusion is that "best model" is no longer a serious question without a benchmark context. In 2026, the meaningful question is: best model for which task, under which cost constraints, with how much published evidence?

See the full leaderboard · Coding rankings · Reasoning rankings · Knowledge rankings · Agentic rankings · Pricing · What benchmarks actually measure · Are AI benchmarks reliable?


Frequently asked questions

What is the best AI model in 2026?
On BenchLM's current data, Gemini 3.1 Pro is #1 overall with a score of 83. It wins because it is strong across every major category rather than dominating only one benchmark family.

Which LLM is best for coding in 2026?
The weighted coding category is led by GPT-5.4 Pro at 86, followed by GPT-5.3 Codex at 85. But the benchmark-level answer depends on what you care about: GPT-5.4 Pro leads SWE-bench Verified, Gemini 3.1 Pro leads SWE-bench Pro, and Kimi K2.5 leads LiveCodeBench.

Are open-weight models close to proprietary models in 2026?
Closer, yes. Equal, no. Qwen2.5-1M ranks #9 overall and DeepSeek V3.2 (Thinking) ranks #16, which means the open-weight tier is now competitive enough to matter in serious model selection. The remaining gap is largest in coding depth, multimodal quality, and broad all-category consistency.

Which benchmarks matter most in 2026?
The strongest current separators are HLE, MMLU-Pro, GPQA, SWE-bench Pro, LiveCodeBench, Terminal-Bench 2.0, BrowseComp, OSWorld-Verified, MMMU-Pro, and OfficeQA Pro. MMLU and HumanEval still provide context, but they are weaker primary ranking signals now.

Are older benchmarks like MMLU still useful?
Only as a baseline. MMLU no longer separates top frontier models in a meaningful way because the best systems all cluster at 99.

How does BenchLM rank models?
BenchLM uses weighted category scores rather than a flat average. Agentic and coding carry the most weight, and within each category the benchmarks that still produce real separation carry more influence than saturated legacy tests.


All data sourced from BenchLM.ai. Dataset last updated March 18, 2026.

Enjoyed this post?

Get weekly benchmark updates in your inbox.

Free. No spam. Unsubscribe anytime. We only store derived location metadata for consent routing.

Share This Report

Copy the link, post it, or save a PDF version.

More posts
Share on XShare on LinkedIn

Weekly LLM Updates

New model releases, benchmark scores, and leaderboard changes. Every Friday.

Free. Your signup is stored with a derived country code for compliance routing.