Are open source and open weight models close to proprietary models in 2026?

Closer than they were, but still behind at the top. DeepSeek V4 Pro (Max) now scores 87 overall, Kimi K2.6 scores 84, GLM-5 (Reasoning) and GLM-5.1 score 83, and Qwen3.5 397B (Reasoning) scores 79. The best open-weight row still trails the current mainstream proprietary leader at 93 by 6 points.

How does BenchLM rank AI models?

BenchLM combines eight weighted benchmark categories into an overall score, then applies bounded external calibration. Models need enough non-generated benchmark coverage across multiple categories to rank safely.

State of LLM Benchmarks 2026: Rankings, Trends, and What Actually Changed

Q: What is the best AI model in 2026?

On BenchLM's current live data, Claude Mythos Preview leads overall at 99. Among the broader mainstream frontier rows, Gemini 3.1 Pro is at 93, followed by GPT-5.4 Pro at 92, Grok 4.1 at 90, and GPT-5.5 at 89.

Q: Which LLM is best for coding in 2026?

On BenchLM's current coding category score, Claude Mythos Preview leads at 100. Among broadly used mainstream rows, Gemini 3.1 Pro is at 94.3, GPT-5.4 Pro at 92.8, Claude Opus 4.6 at 90.8, and GPT-5.4 at 90.7.

Q: Which benchmarks matter most in 2026?

The most useful frontier signals are still the benchmarks with real spread: HLE, GPQA, MMLU-Pro, SWE-bench Pro, SWE-bench Verified, LiveCodeBench, Terminal-Bench 2.0, BrowseComp, OSWorld-Verified, MMMU-Pro, and OfficeQA Pro.

Q: Are older benchmarks like MMLU still useful?

Only as baseline context. MMLU still matters historically, but it is much weaker than HLE, GPQA, and MMLU-Pro for separating current frontier rows.

The benchmark picture in April 2026 is different from the one many people still have in their head. The old story was simple: one or two headline models sat clearly above the field, older knowledge benchmarks still mattered too much, and open-weight rows were interesting but not yet close. The current data is messier and more useful.

The top of the leaderboard is now fragmented. Claude Mythos Preview sits at 99 overall, but the broader mainstream frontier cluster is tighter: Gemini 3.1 Pro is at 93, GPT-5.4 Pro is at 92, Grok 4.1 is at 90, and GPT-5.5 is at 89. GPT-5.5 now sits above the superseded GPT-5.4 row after stale external calibration is stripped from superseded models. Open-weight models have moved up too. DeepSeek V4 Pro (Max) is at 87, Kimi K2.6 is at 84, GLM-5 (Reasoning) and GLM-5.1 are both at 83, and Qwen3.5 397B (Reasoning) is at 79.

All data below reflects BenchLM's live dataset, last updated April 24, 2026.

Key findings

The very top is no longer a single-model story. Claude Mythos Preview leads at 99, but the broader mainstream frontier is a 93/92/90/89/89 cluster.
Coding is still one of the best separators. Claude Mythos Preview, Gemini 3.1 Pro, GPT-5.4 Pro, Claude Opus 4.6, and GPT-5.4 all remain tightly grouped, but real spread still exists.
Agentic benchmarks still matter. GPT-5.4 remains one of the clearest broad-purpose leaders on agentic work, while narrow specialist rows can still spike higher.
Open-weight rows are now real top-tier entrants. DeepSeek V4 Pro (Max), Kimi K2.6, GLM-5 (Reasoning), GLM-5.1, and Qwen3.5 397B (Reasoning) are not novelty rows anymore.
Benchmark choice matters more than ever. The older saturated tests are still useful for context, but the frontier is now decided by harder benchmarks with meaningful spread.

The overall leaderboard

Top 10 models overall

Rank	Model	Creator	Overall	Notes
1	Claude Mythos Preview	Anthropic	99	Current overall leader
2	Gemini 3.1 Pro	Google	93	Best value mainstream flagship
3	GPT-5.4 Pro	OpenAI	92	Strongest specialist reasoning/math row
4	Grok 4.1	xAI	90	Strong broad benchmark profile
5	GPT-5.5	OpenAI	89	Stronger reasoning row with limited current coverage
6	GPT-5.3 Codex	OpenAI	89	Elite coding-oriented row
7	Claude Opus 4.6	Anthropic	88	Best writing-first flagship
8	GPT-5.4	OpenAI	88	Superseded but still broad OpenAI row
9	Claude Opus 4.7	Anthropic	86	Strong coding/agentic row with low current OfficeQA Pro evidence
10	Gemini 3 Pro Deep Think	Google	86	Strong multimodal and reasoning profile

The main thing to notice is that the current leaderboard no longer maps cleanly onto a single vendor narrative. Anthropic has the #1 row, Google has the strongest mainstream value row, OpenAI has multiple tightly clustered frontier rows, and DeepSeek, Moonshot, and Z.AI now have serious open-weight entries.

Category leaders that matter

Coding

Rank	Model	Coding score
1	Claude Mythos Preview	100
2	Gemini 3.1 Pro	94.3
3	GPT-5.4 Pro	92.8
4	Claude Opus 4.6	90.8
5	GPT-5.4	90.7

Coding is still one of the cleanest frontier separators because the main coding benchmarks have not fully collapsed into saturation. The field is very tight at the top, but it is still meaningfully rankable.

Agentic

Rank	Model	Agentic score
1	Claude Mythos Preview	100
2	GPT-5.4	93.5
3	Claude Opus 4.6	92.6
4	GPT-5.4 Pro	92.4
5	Gemini 3.1 Pro	87.8

Agentic remains one of the best reasons not to overfocus on old academic-style benchmarks. The models that win here are the ones that matter most for actual tool use, software interaction, and multi-step workflows.

Reasoning

Rank	Model	Reasoning score
1	GPT-5.4 Pro	99.3
2	Gemini 3.1 Pro	97
3	GPT-5.3 Codex	94.7
4	GPT-5.4	93
5	Grok 4.1	91.9

Knowledge

Rank	Model	Knowledge score
1	Muse Spark	100
2	Claude Mythos Preview	98.7
3	GPT-5.4	97.6
4	Gemini 3.1 Pro	95.6
5	Grok 4.1	94.7

Knowledge is where the benchmark-selection issue matters most. Some rows still look strong because of older tests that are partially saturated. The harder separators now are HLE, GPQA, and MMLU-Pro rather than the older baseline-style knowledge benchmarks.

Multimodal grounded

Rank	Model	Multimodal score
1	GPT-5.4 Pro	100
2	Gemini 3 Pro Deep Think	100
3	Claude Mythos Preview	97.8
4	Grok 4.1	97.5
5	GPT-5.1	95.8

This is one of the most commercially relevant categories in 2026 because real workloads increasingly involve screenshots, documents, charts, and mixed-media contexts rather than only plain text.

Which benchmarks still matter

The benchmarks that still do useful work are the ones with real frontier spread:

HLE for hard knowledge
GPQA and MMLU-Pro for knowledge depth
SWE-bench Pro, SWE-bench Verified, and LiveCodeBench for coding
Terminal-Bench 2.0, BrowseComp, and OSWorld-Verified for agentic work
MMMU-Pro and OfficeQA Pro for multimodal grounded tasks

The benchmarks that matter less than they used to are the saturated or legacy rows that mostly act as floor checks rather than frontier separators.

Open-weight versus proprietary

The open-weight story is now much stronger than it was a year ago.

Model	Type	Overall
Gemini 3.1 Pro	Proprietary	93
GPT-5.4 Pro	Proprietary	92
Claude Opus 4.6	Proprietary	88
GPT-5.4	Proprietary	88
DeepSeek V4 Pro (Max)	Open Weight	87
Kimi K2.6	Open Weight	84
GLM-5 (Reasoning)	Open Weight	83
GLM-5.1	Open Weight	83
Qwen3.5 397B (Reasoning)	Open Weight	79

The top open-weight row still trails the top mainstream proprietary tier by 6 points. That is not parity. It is also no longer an afterthought. The difference now is smaller, and in some narrow categories open-weight rows are fully competitive.

What changed the most

The top flagship narrative changed. It is no longer obvious that one lab owns the whole board.
Open-weight models moved up. DeepSeek V4 Pro (Max), Kimi K2.6, GLM-5 (Reasoning), and GLM-5.1 are now firmly part of the serious comparison set.
Coding and agentic work stayed useful. These categories still produce some of the best real-world separation.
The old benchmark mix matters less. The frontier is increasingly decided by harder tasks rather than legacy rows that mostly tell you a model is not broken.

Bottom line

The 2026 benchmark landscape is broader, tighter, and harder to summarize with one headline winner.

If you want the current #1 row, it is Claude Mythos Preview.
If you want the strongest mainstream value flagship, it is Gemini 3.1 Pro.
If you want the strongest broad OpenAI default, it is GPT-5.4.
If you want the strongest open-weight overall row, it is DeepSeek V4 Pro (Max).

The big change is not just who ranks first. It is that the leaderboard now has multiple credible top-tier stories depending on whether you care about value, specialist depth, interaction quality, or open-weight access.

Frequently asked questions

What is the best AI model in 2026? On BenchLM's current data, Claude Mythos Preview leads overall at 99. Among the broader mainstream frontier rows, Gemini 3.1 Pro leads at 93, followed by GPT-5.4 Pro at 92, Grok 4.1 at 90, and GPT-5.5 at 89.

Which LLM is best for coding in 2026? Claude Mythos Preview leads the current coding category. Among mainstream rows, Gemini 3.1 Pro, GPT-5.4 Pro, Claude Opus 4.6, and GPT-5.4 are all clustered tightly near the top.

Are open-weight models close to proprietary models now? Closer, yes. Equal, no. The best open-weight row is 87 versus 93 for the current mainstream proprietary leader.

Which benchmarks matter most in 2026? HLE, GPQA, MMLU-Pro, SWE-bench Pro, SWE-bench Verified, LiveCodeBench, Terminal-Bench 2.0, BrowseComp, OSWorld-Verified, MMMU-Pro, and OfficeQA Pro.

Are older benchmarks like MMLU still useful? Mostly as baseline context. The more meaningful frontier separation now comes from the harder benchmark set.

All benchmark data is from BenchLM's live dataset, current as of April 24, 2026.