Skip to main content
rankingbenchmarkscomparisonguidellm

State of LLM Benchmarks 2026: Rankings, Trends, and What Actually Changed

State of LLM benchmarks in 2026: current BenchLM rankings, category leaders, benchmark trends, open vs closed performance, and what still matters after the latest scoring changes.

Glevd·Published March 22, 2026·Updated April 8, 2026·17 min read

Share This Report

Copy the link, post it, or save a PDF version.

Share on XShare on LinkedIn

The benchmark picture in April 2026 is different from the one many people still have in their head. The old story was simple: one or two headline models sat clearly above the field, older knowledge benchmarks still mattered too much, and open-weight rows were interesting but not yet close. The current data is messier and more useful.

The top of the leaderboard is now fragmented. Claude Mythos Preview sits at 99 overall, but the broader mainstream frontier cluster is tighter: Gemini 3.1 Pro and GPT-5.4 are tied at 94, with Claude Opus 4.6 and GPT-5.4 Pro at 92. Open-weight models have moved up too. GLM-5 (Reasoning) is at 85, GLM-5.1 at 84, and Qwen3.5 397B (Reasoning) at 81.

All data below reflects BenchLM's live dataset, last updated April 8, 2026.

Key findings

  • The very top is no longer a single-model story. Claude Mythos Preview leads at 99, but the broader mainstream frontier is a 94/94/92/92 cluster.
  • Coding is still one of the best separators. Claude Mythos Preview, Gemini 3.1 Pro, GPT-5.4 Pro, Claude Opus 4.6, and GPT-5.4 all remain tightly grouped, but real spread still exists.
  • Agentic benchmarks still matter. GPT-5.4 remains one of the clearest broad-purpose leaders on agentic work, while narrow specialist rows can still spike higher.
  • Open-weight rows are now real top-tier entrants. GLM-5 (Reasoning), GLM-5.1, and Qwen3.5 397B (Reasoning) are not novelty rows anymore.
  • Benchmark choice matters more than ever. The older saturated tests are still useful for context, but the frontier is now decided by harder benchmarks with meaningful spread.

The overall leaderboard

Top 10 models overall

Rank Model Creator Overall Notes
1 Claude Mythos Preview Anthropic 99 Current overall leader
2 Gemini 3.1 Pro Google 94 Best value mainstream flagship
3 GPT-5.4 OpenAI 94 Strongest broad OpenAI default
4 Claude Opus 4.6 Anthropic 92 Best writing-first flagship
5 GPT-5.4 Pro OpenAI 92 Strongest specialist reasoning/math row
6 GPT-5.3 Codex OpenAI 89 Elite coding-oriented row
7 Gemini 3 Pro Deep Think Google 87 Strong multimodal and reasoning profile
8 Claude Sonnet 4.6 Anthropic 86 Broad, cheaper Anthropic flagship lane
9 GLM-5 (Reasoning) Z.AI 85 Best open-weight overall row
10 GLM-5.1 Z.AI 84 Strong follow-on open-weight row

The main thing to notice is that the current leaderboard no longer maps cleanly onto a single vendor narrative. Anthropic has the #1 row, Google and OpenAI are tied just behind, and Z.AI now has the strongest open-weight entries.

Category leaders that matter

Coding

Rank Model Coding score
1 Claude Mythos Preview 100
2 Gemini 3.1 Pro 94.3
3 GPT-5.4 Pro 92.8
4 Claude Opus 4.6 90.8
5 GPT-5.4 90.7

Coding is still one of the cleanest frontier separators because the main coding benchmarks have not fully collapsed into saturation. The field is very tight at the top, but it is still meaningfully rankable.

Agentic

Rank Model Agentic score
1 Claude Mythos Preview 100
2 GPT-5.4 93.5
3 Claude Opus 4.6 92.6
4 GPT-5.4 Pro 92.4
5 Gemini 3.1 Pro 87.8

Agentic remains one of the best reasons not to overfocus on old academic-style benchmarks. The models that win here are the ones that matter most for actual tool use, software interaction, and multi-step workflows.

Reasoning

Rank Model Reasoning score
1 GPT-5.4 Pro 99.3
2 Gemini 3.1 Pro 97
3 GPT-5.3 Codex 94.7
4 GPT-5.4 93
5 Grok 4.1 91.9

Knowledge

Rank Model Knowledge score
1 Muse Spark 100
2 Claude Mythos Preview 98.7
3 GPT-5.4 97.6
4 Gemini 3.1 Pro 95.6
5 Grok 4.1 94.7

Knowledge is where the benchmark-selection issue matters most. Some rows still look strong because of older tests that are partially saturated. The harder separators now are HLE, GPQA, and MMLU-Pro rather than the older baseline-style knowledge benchmarks.

Multimodal grounded

Rank Model Multimodal score
1 GPT-5.4 Pro 100
2 Gemini 3 Pro Deep Think 100
3 Claude Mythos Preview 97.8
4 Grok 4.1 97.5
5 GPT-5.1 95.8

This is one of the most commercially relevant categories in 2026 because real workloads increasingly involve screenshots, documents, charts, and mixed-media contexts rather than only plain text.

Which benchmarks still matter

The benchmarks that still do useful work are the ones with real frontier spread:

  • HLE for hard knowledge
  • GPQA and MMLU-Pro for knowledge depth
  • SWE-bench Pro, SWE-bench Verified, and LiveCodeBench for coding
  • Terminal-Bench 2.0, BrowseComp, and OSWorld-Verified for agentic work
  • MMMU-Pro and OfficeQA Pro for multimodal grounded tasks

The benchmarks that matter less than they used to are the saturated or legacy rows that mostly act as floor checks rather than frontier separators.

Open-weight versus proprietary

The open-weight story is now much stronger than it was a year ago.

Model Type Overall
Gemini 3.1 Pro Proprietary 94
GPT-5.4 Proprietary 94
Claude Opus 4.6 Proprietary 92
GLM-5 (Reasoning) Open Weight 85
GLM-5.1 Open Weight 84
Qwen3.5 397B (Reasoning) Open Weight 81

The top open-weight row still trails the top proprietary tier by 9 points. That is not parity. It is also no longer an afterthought. The difference now is smaller, and in some narrow categories open-weight rows are fully competitive.

What changed the most

  • The top flagship narrative changed. It is no longer obvious that one lab owns the whole board.
  • Open-weight models moved up. GLM-5 (Reasoning) and GLM-5.1 are now firmly part of the serious comparison set.
  • Coding and agentic work stayed useful. These categories still produce some of the best real-world separation.
  • The old benchmark mix matters less. The frontier is increasingly decided by harder tasks rather than legacy rows that mostly tell you a model is not broken.

Bottom line

The 2026 benchmark landscape is broader, tighter, and harder to summarize with one headline winner.

  • If you want the current #1 row, it is Claude Mythos Preview.
  • If you want the strongest mainstream value flagship, it is Gemini 3.1 Pro.
  • If you want the strongest broad OpenAI default, it is GPT-5.4.
  • If you want the strongest open-weight overall row, it is GLM-5 (Reasoning).

The big change is not just who ranks first. It is that the leaderboard now has multiple credible top-tier stories depending on whether you care about value, specialist depth, interaction quality, or open-weight access.


Frequently asked questions

What is the best AI model in 2026? On BenchLM's current data, Claude Mythos Preview leads overall at 99. Among the broader mainstream frontier rows, Gemini 3.1 Pro and GPT-5.4 are tied at 94.

Which LLM is best for coding in 2026? Claude Mythos Preview leads the current coding category. Among mainstream rows, Gemini 3.1 Pro, GPT-5.4 Pro, Claude Opus 4.6, and GPT-5.4 are all clustered tightly near the top.

Are open-weight models close to proprietary models now? Closer, yes. Equal, no. The best open-weight row is 85 versus 94 for the current proprietary leaders.

Which benchmarks matter most in 2026? HLE, GPQA, MMLU-Pro, SWE-bench Pro, SWE-bench Verified, LiveCodeBench, Terminal-Bench 2.0, BrowseComp, OSWorld-Verified, MMMU-Pro, and OfficeQA Pro.

Are older benchmarks like MMLU still useful? Mostly as baseline context. The more meaningful frontier separation now comes from the harder benchmark set.


All benchmark data is from BenchLM's live dataset, current as of April 8, 2026.

New models drop every week. We send one email a week with what moved and why.