State of LLM benchmarks in 2026: current BenchLM rankings, category leaders, benchmark trends, open vs closed performance, and what still matters after the latest scoring changes.
Share This Report
Copy the link, post it, or save a PDF version.
The benchmark picture in April 2026 is different from the one many people still have in their head. The old story was simple: one or two headline models sat clearly above the field, older knowledge benchmarks still mattered too much, and open-weight rows were interesting but not yet close. The current data is messier and more useful.
The top of the leaderboard is now fragmented. Claude Mythos Preview sits at 99 overall, but the broader mainstream frontier cluster is tighter: Gemini 3.1 Pro and GPT-5.4 are tied at 94, with Claude Opus 4.6 and GPT-5.4 Pro at 92. Open-weight models have moved up too. GLM-5 (Reasoning) is at 85, GLM-5.1 at 84, and Qwen3.5 397B (Reasoning) at 81.
All data below reflects BenchLM's live dataset, last updated April 8, 2026.
| Rank | Model | Creator | Overall | Notes |
|---|---|---|---|---|
| 1 | Claude Mythos Preview | Anthropic | 99 | Current overall leader |
| 2 | Gemini 3.1 Pro | 94 | Best value mainstream flagship | |
| 3 | GPT-5.4 | OpenAI | 94 | Strongest broad OpenAI default |
| 4 | Claude Opus 4.6 | Anthropic | 92 | Best writing-first flagship |
| 5 | GPT-5.4 Pro | OpenAI | 92 | Strongest specialist reasoning/math row |
| 6 | GPT-5.3 Codex | OpenAI | 89 | Elite coding-oriented row |
| 7 | Gemini 3 Pro Deep Think | 87 | Strong multimodal and reasoning profile | |
| 8 | Claude Sonnet 4.6 | Anthropic | 86 | Broad, cheaper Anthropic flagship lane |
| 9 | GLM-5 (Reasoning) | Z.AI | 85 | Best open-weight overall row |
| 10 | GLM-5.1 | Z.AI | 84 | Strong follow-on open-weight row |
The main thing to notice is that the current leaderboard no longer maps cleanly onto a single vendor narrative. Anthropic has the #1 row, Google and OpenAI are tied just behind, and Z.AI now has the strongest open-weight entries.
| Rank | Model | Coding score |
|---|---|---|
| 1 | Claude Mythos Preview | 100 |
| 2 | Gemini 3.1 Pro | 94.3 |
| 3 | GPT-5.4 Pro | 92.8 |
| 4 | Claude Opus 4.6 | 90.8 |
| 5 | GPT-5.4 | 90.7 |
Coding is still one of the cleanest frontier separators because the main coding benchmarks have not fully collapsed into saturation. The field is very tight at the top, but it is still meaningfully rankable.
| Rank | Model | Agentic score |
|---|---|---|
| 1 | Claude Mythos Preview | 100 |
| 2 | GPT-5.4 | 93.5 |
| 3 | Claude Opus 4.6 | 92.6 |
| 4 | GPT-5.4 Pro | 92.4 |
| 5 | Gemini 3.1 Pro | 87.8 |
Agentic remains one of the best reasons not to overfocus on old academic-style benchmarks. The models that win here are the ones that matter most for actual tool use, software interaction, and multi-step workflows.
| Rank | Model | Reasoning score |
|---|---|---|
| 1 | GPT-5.4 Pro | 99.3 |
| 2 | Gemini 3.1 Pro | 97 |
| 3 | GPT-5.3 Codex | 94.7 |
| 4 | GPT-5.4 | 93 |
| 5 | Grok 4.1 | 91.9 |
| Rank | Model | Knowledge score |
|---|---|---|
| 1 | Muse Spark | 100 |
| 2 | Claude Mythos Preview | 98.7 |
| 3 | GPT-5.4 | 97.6 |
| 4 | Gemini 3.1 Pro | 95.6 |
| 5 | Grok 4.1 | 94.7 |
Knowledge is where the benchmark-selection issue matters most. Some rows still look strong because of older tests that are partially saturated. The harder separators now are HLE, GPQA, and MMLU-Pro rather than the older baseline-style knowledge benchmarks.
| Rank | Model | Multimodal score |
|---|---|---|
| 1 | GPT-5.4 Pro | 100 |
| 2 | Gemini 3 Pro Deep Think | 100 |
| 3 | Claude Mythos Preview | 97.8 |
| 4 | Grok 4.1 | 97.5 |
| 5 | GPT-5.1 | 95.8 |
This is one of the most commercially relevant categories in 2026 because real workloads increasingly involve screenshots, documents, charts, and mixed-media contexts rather than only plain text.
The benchmarks that still do useful work are the ones with real frontier spread:
The benchmarks that matter less than they used to are the saturated or legacy rows that mostly act as floor checks rather than frontier separators.
The open-weight story is now much stronger than it was a year ago.
| Model | Type | Overall |
|---|---|---|
| Gemini 3.1 Pro | Proprietary | 94 |
| GPT-5.4 | Proprietary | 94 |
| Claude Opus 4.6 | Proprietary | 92 |
| GLM-5 (Reasoning) | Open Weight | 85 |
| GLM-5.1 | Open Weight | 84 |
| Qwen3.5 397B (Reasoning) | Open Weight | 81 |
The top open-weight row still trails the top proprietary tier by 9 points. That is not parity. It is also no longer an afterthought. The difference now is smaller, and in some narrow categories open-weight rows are fully competitive.
The 2026 benchmark landscape is broader, tighter, and harder to summarize with one headline winner.
The big change is not just who ranks first. It is that the leaderboard now has multiple credible top-tier stories depending on whether you care about value, specialist depth, interaction quality, or open-weight access.
What is the best AI model in 2026? On BenchLM's current data, Claude Mythos Preview leads overall at 99. Among the broader mainstream frontier rows, Gemini 3.1 Pro and GPT-5.4 are tied at 94.
Which LLM is best for coding in 2026? Claude Mythos Preview leads the current coding category. Among mainstream rows, Gemini 3.1 Pro, GPT-5.4 Pro, Claude Opus 4.6, and GPT-5.4 are all clustered tightly near the top.
Are open-weight models close to proprietary models now? Closer, yes. Equal, no. The best open-weight row is 85 versus 94 for the current proprietary leaders.
Which benchmarks matter most in 2026? HLE, GPQA, MMLU-Pro, SWE-bench Pro, SWE-bench Verified, LiveCodeBench, Terminal-Bench 2.0, BrowseComp, OSWorld-Verified, MMMU-Pro, and OfficeQA Pro.
Are older benchmarks like MMLU still useful? Mostly as baseline context. The more meaningful frontier separation now comes from the harder benchmark set.
All benchmark data is from BenchLM's live dataset, current as of April 8, 2026.
New models drop every week. We send one email a week with what moved and why.
Share This Report
Copy the link, post it, or save a PDF version.
On this page
Which models moved up, what’s new, and what it costs. One email a week, 3-min read.
Free. One email per week.
GPT-5.4 and Gemini 3.1 Pro are one point apart on BenchLM's leaderboard. We compare coding, math, reasoning, multimodal, agentic, pricing, and context window — with Gemini Deep Think as the reasoning wildcard.
Which AI model is best for writing in 2026? We rank Claude, GPT, Gemini, and open source LLMs by creative writing Arena scores, instruction-following benchmarks, and real-world content quality — with pricing for every budget.
Which open source LLM is best in 2026? We rank the top open weight models by real benchmark data — GLM-5, Qwen3.5, Gemma 4, Kimi K2.5, Llama — and compare them to proprietary leaders.