State of LLM benchmarks in 2026: current BenchLM rankings, category leaders, benchmark trends, open vs closed performance, and what still matters after the latest scoring changes.
Share This Report
Copy the link, post it, or save a PDF version.
The benchmark picture in April 2026 is different from the one many people still have in their head. The old story was simple: one or two headline models sat clearly above the field, older knowledge benchmarks still mattered too much, and open-weight rows were interesting but not yet close. The current data is messier and more useful.
The top of the leaderboard is now fragmented. Claude Mythos Preview sits at 99 overall, but the broader mainstream frontier cluster is tighter: Gemini 3.1 Pro is at 93, GPT-5.4 Pro is at 92, Grok 4.1 is at 90, and GPT-5.5 is at 89. GPT-5.5 now sits above the superseded GPT-5.4 row after stale external calibration is stripped from superseded models. Open-weight models have moved up too. DeepSeek V4 Pro (Max) is at 87, Kimi K2.6 is at 84, GLM-5 (Reasoning) and GLM-5.1 are both at 83, and Qwen3.5 397B (Reasoning) is at 79.
All data below reflects BenchLM's live dataset, last updated April 24, 2026.
| Rank | Model | Creator | Overall | Notes |
|---|---|---|---|---|
| 1 | Claude Mythos Preview | Anthropic | 99 | Current overall leader |
| 2 | Gemini 3.1 Pro | 93 | Best value mainstream flagship | |
| 3 | GPT-5.4 Pro | OpenAI | 92 | Strongest specialist reasoning/math row |
| 4 | Grok 4.1 | xAI | 90 | Strong broad benchmark profile |
| 5 | GPT-5.5 | OpenAI | 89 | Stronger reasoning row with limited current coverage |
| 6 | GPT-5.3 Codex | OpenAI | 89 | Elite coding-oriented row |
| 7 | Claude Opus 4.6 | Anthropic | 88 | Best writing-first flagship |
| 8 | GPT-5.4 | OpenAI | 88 | Superseded but still broad OpenAI row |
| 9 | Claude Opus 4.7 | Anthropic | 86 | Strong coding/agentic row with low current OfficeQA Pro evidence |
| 10 | Gemini 3 Pro Deep Think | 86 | Strong multimodal and reasoning profile |
The main thing to notice is that the current leaderboard no longer maps cleanly onto a single vendor narrative. Anthropic has the #1 row, Google has the strongest mainstream value row, OpenAI has multiple tightly clustered frontier rows, and DeepSeek, Moonshot, and Z.AI now have serious open-weight entries.
| Rank | Model | Coding score |
|---|---|---|
| 1 | Claude Mythos Preview | 100 |
| 2 | Gemini 3.1 Pro | 94.3 |
| 3 | GPT-5.4 Pro | 92.8 |
| 4 | Claude Opus 4.6 | 90.8 |
| 5 | GPT-5.4 | 90.7 |
Coding is still one of the cleanest frontier separators because the main coding benchmarks have not fully collapsed into saturation. The field is very tight at the top, but it is still meaningfully rankable.
| Rank | Model | Agentic score |
|---|---|---|
| 1 | Claude Mythos Preview | 100 |
| 2 | GPT-5.4 | 93.5 |
| 3 | Claude Opus 4.6 | 92.6 |
| 4 | GPT-5.4 Pro | 92.4 |
| 5 | Gemini 3.1 Pro | 87.8 |
Agentic remains one of the best reasons not to overfocus on old academic-style benchmarks. The models that win here are the ones that matter most for actual tool use, software interaction, and multi-step workflows.
| Rank | Model | Reasoning score |
|---|---|---|
| 1 | GPT-5.4 Pro | 99.3 |
| 2 | Gemini 3.1 Pro | 97 |
| 3 | GPT-5.3 Codex | 94.7 |
| 4 | GPT-5.4 | 93 |
| 5 | Grok 4.1 | 91.9 |
| Rank | Model | Knowledge score |
|---|---|---|
| 1 | Muse Spark | 100 |
| 2 | Claude Mythos Preview | 98.7 |
| 3 | GPT-5.4 | 97.6 |
| 4 | Gemini 3.1 Pro | 95.6 |
| 5 | Grok 4.1 | 94.7 |
Knowledge is where the benchmark-selection issue matters most. Some rows still look strong because of older tests that are partially saturated. The harder separators now are HLE, GPQA, and MMLU-Pro rather than the older baseline-style knowledge benchmarks.
| Rank | Model | Multimodal score |
|---|---|---|
| 1 | GPT-5.4 Pro | 100 |
| 2 | Gemini 3 Pro Deep Think | 100 |
| 3 | Claude Mythos Preview | 97.8 |
| 4 | Grok 4.1 | 97.5 |
| 5 | GPT-5.1 | 95.8 |
This is one of the most commercially relevant categories in 2026 because real workloads increasingly involve screenshots, documents, charts, and mixed-media contexts rather than only plain text.
The benchmarks that still do useful work are the ones with real frontier spread:
The benchmarks that matter less than they used to are the saturated or legacy rows that mostly act as floor checks rather than frontier separators.
The open-weight story is now much stronger than it was a year ago.
| Model | Type | Overall |
|---|---|---|
| Gemini 3.1 Pro | Proprietary | 93 |
| GPT-5.4 Pro | Proprietary | 92 |
| Claude Opus 4.6 | Proprietary | 88 |
| GPT-5.4 | Proprietary | 88 |
| DeepSeek V4 Pro (Max) | Open Weight | 87 |
| Kimi K2.6 | Open Weight | 84 |
| GLM-5 (Reasoning) | Open Weight | 83 |
| GLM-5.1 | Open Weight | 83 |
| Qwen3.5 397B (Reasoning) | Open Weight | 79 |
The top open-weight row still trails the top mainstream proprietary tier by 6 points. That is not parity. It is also no longer an afterthought. The difference now is smaller, and in some narrow categories open-weight rows are fully competitive.
The 2026 benchmark landscape is broader, tighter, and harder to summarize with one headline winner.
The big change is not just who ranks first. It is that the leaderboard now has multiple credible top-tier stories depending on whether you care about value, specialist depth, interaction quality, or open-weight access.
What is the best AI model in 2026? On BenchLM's current data, Claude Mythos Preview leads overall at 99. Among the broader mainstream frontier rows, Gemini 3.1 Pro leads at 93, followed by GPT-5.4 Pro at 92, Grok 4.1 at 90, and GPT-5.5 at 89.
Which LLM is best for coding in 2026? Claude Mythos Preview leads the current coding category. Among mainstream rows, Gemini 3.1 Pro, GPT-5.4 Pro, Claude Opus 4.6, and GPT-5.4 are all clustered tightly near the top.
Are open-weight models close to proprietary models now? Closer, yes. Equal, no. The best open-weight row is 87 versus 93 for the current mainstream proprietary leader.
Which benchmarks matter most in 2026? HLE, GPQA, MMLU-Pro, SWE-bench Pro, SWE-bench Verified, LiveCodeBench, Terminal-Bench 2.0, BrowseComp, OSWorld-Verified, MMMU-Pro, and OfficeQA Pro.
Are older benchmarks like MMLU still useful? Mostly as baseline context. The more meaningful frontier separation now comes from the harder benchmark set.
All benchmark data is from BenchLM's live dataset, current as of April 24, 2026.
New models drop every week. We send one email a week with what moved and why.
Share This Report
Copy the link, post it, or save a PDF version.
On this page
Which models moved up, what’s new, and what it costs. One email a week, 3-min read.
Free. One email per week.
GPT-5.4 and Gemini 3.1 Pro are one point apart on BenchLM's leaderboard. We compare coding, math, reasoning, multimodal, agentic, pricing, and context window — with Gemini Deep Think as the reasoning wildcard.
Which AI model is best for writing in 2026? We rank Claude, GPT, Gemini, and open source LLMs by creative writing Arena scores, instruction-following benchmarks, and real-world content quality — with pricing for every budget.
Which open source LLM is best in 2026? We rank the top open weight models by real benchmark data — DeepSeek V4, Kimi K2.6, GLM-5, Qwen3.5, Gemma 4, Llama — and compare them to proprietary leaders.