State of LLM benchmarks in 2026: top AI model rankings, category leaders, benchmark trends, open vs closed performance, pricing context, and methodology from BenchLM.
Share This Report
Copy the link, post it, or save a PDF version.
Gemini 3.1 Pro is the current #1 model on BenchLM's overall leaderboard with a score of 83. GPT-5.4 is #2 at 80. Claude Opus 4.6, Claude Sonnet 4.6, and GPT-5.4 Pro cluster behind them at 76. That is the headline. The more important point is why the ranking looks like this: in 2026, the best models are no longer separated by one generic idea of "intelligence." They are separated by coverage, benchmark mix, and category-specific strengths.
The benchmark landscape also changed. MMLU is effectively saturated. HumanEval is close to solved at the frontier. The benchmarks that still matter are the ones that force spread: HLE for hard knowledge, MMLU-Pro and GPQA for nontrivial factual reasoning, SWE-bench Pro and LiveCodeBench for coding, Terminal-Bench 2.0 and OSWorld-Verified for agents, and MMMU-Pro plus OfficeQA Pro for multimodal work. That shift matters more than any single model launch.
All data below reflects the current BenchLM dataset, last updated March 18, 2026, covering 135 models.
BenchLM's overall score weights eight categories: agentic (22%), coding (20%), reasoning (17%), multimodal grounded (12%), knowledge (12%), multilingual (7%), instruction following (5%), and math (5%). That weighting matters. A model with one elite skill and missing category coverage can lose to a model that is merely excellent everywhere.
| Rank | Model | Creator | Overall | Strongest category | Notes |
|---|---|---|---|---|---|
| 1 | Gemini 3.1 Pro | 83 | Multimodal / Math | Broadest elite all-rounder with full cross-category coverage | |
| 2 | GPT-5.4 | OpenAI | 80 | Instruction following / Reasoning | Broad coverage, strongest all-purpose OpenAI general model |
| 3 | Claude Opus 4.6 | Anthropic | 76 | Math / Multilingual | Excellent depth, especially HLE and multilingual |
| 4 | Claude Sonnet 4.6 | Anthropic | 76 | Math / Multimodal | Broader coverage than Opus, weaker coding profile |
| 5 | GPT-5.4 Pro | OpenAI | 76 | Math / Knowledge / Reasoning | Best specialist profile, but narrower published coverage |
| 6 | Gemini 3 Pro Deep Think | 70 | Math / Agentic | Strong on BrowseComp, but not a complete all-category row | |
| 7 | GPT-5.3 Codex | OpenAI | 70 | Math / Reasoning / Coding | One of the strongest coding-focused models on BenchLM |
| 8 | o3-mini | OpenAI | 70 | Instruction following / Reasoning | Strong reasoning package with weaker coding depth |
| 9 | Qwen2.5-1M | Alibaba | 67 | Math / Reasoning | Best open-weight overall model on BenchLM right now |
| 10 | Grok 4 | xAI | 67 | Math / Reasoning | Strong coding and multimodal row, weaker agentic profile |
| 11 | GPT-5.2 | OpenAI | 67 | Math / Knowledge | Strong older OpenAI row, weaker agentic coverage |
| 12 | Kimi K2.5 (Reasoning) | Moonshot AI | 67 | Multilingual / Math | Excellent coding and multilingual signals, sparser breadth |
| 13 | GPT-4.1 | OpenAI | 67 | Instruction following | Broadly competent, no longer frontier-leading in coding |
| 14 | o1 | OpenAI | 67 | Instruction following | Reasoning-first model with weak coding relative to 2026 leaders |
| 15 | Grok 4.1 | xAI | 67 | Math / Knowledge / Reasoning | Very strong partial row, but far narrower benchmark footprint |
| 16 | DeepSeek V3.2 (Thinking) | DeepSeek | 66 | Math / Multilingual | Best open-weight reasoning-agentic compromise from DeepSeek |
| 17 | DeepSeek Coder 2.0 | DeepSeek | 66 | Instruction following / Math | Strong open coding option, still well behind top closed models overall |
| 18 | o3 | OpenAI | 63 | Math | Competitive reasoning line, weak coding row |
| 19 | Nemotron 3 Ultra 500B | NVIDIA | 63 | Instruction following | Open-weight scale play with weak coding depth |
| 20 | Qwen3.5 397B (Reasoning) | Alibaba | 63 | Math / Reasoning | Strong narrow reasoning profile, less cross-category evidence |
Gemini 3.1 Pro does not win because it has the single highest ceiling anywhere. It wins because it almost never drops.
Its category profile is unusually balanced:
That is the kind of row that survives any reasonable weighting system. It also has benchmark coverage across all eight categories, which matters when overall scores are meant to represent real deployment breadth rather than a lab highlight reel.
The strongest part of the Gemini case is multimodal plus long-context utility. It leads the weighted multimodal category at 95, ties the top MMMU-Pro score at 95, and remains competitive on OfficeQA Pro at 95. It also holds up unusually well in multilingual and math. If you want a single general-purpose model with few obvious holes, Gemini 3.1 Pro has the cleanest argument on current data.
GPT-5.4 ranks #2 overall at 80 and is probably the strongest "safe default" alternative to Gemini 3.1 Pro.
Its broad row is difficult to dismiss:
It also leads some of the benchmarks that matter most in actual product workflows. GPT-5.4 is #1 on OfficeQA Pro at 96 and #1 on OSWorld-Verified at 75. That matters because those are not toy tasks. They reward models that can operate over messy interfaces, documents, screenshots, and workflow-like inputs.
The only obvious hole is math coverage. GPT-5.4 has no current math row in BenchLM's dataset, which keeps it from challenging GPT-5.4 Pro on pure specialist strength. But for teams that care more about breadth than benchmark perfection in a narrow category, GPT-5.4 remains a very strong all-around production option.
GPT-5.4 Pro is the strongest model on several individual category tables:
If you only looked at these rows, you would assume GPT-5.4 Pro is the clear overall winner. It is not. The reason is simple: the current published row is narrower. GPT-5.4 Pro has no current agentic, multimodal, or multilingual category eligibility in BenchLM's data. BenchLM's overall leaderboard rewards demonstrated breadth. That is the correct choice for a production ranking, even if it is less flattering to narrow specialist leaders.
GPT-5.3 Codex is the same story in a more coding-centric form. It ranks #7 overall, but it remains one of the strongest coding models in the dataset:
For teams choosing an API for coding agents or software tooling, GPT-5.3 Codex is arguably more interesting than its overall rank suggests.
Anthropic's frontier is split between two models:
The important takeaway is that the frontier is no longer winner-take-all. Gemini leads the most balanced overall package. GPT has the strongest specialist lines in several categories. Anthropic still owns some of the hardest knowledge and multilingual rows. Model selection is now more about use case fit than abstract prestige.
The overall leaderboard is useful, but it hides where the real separation lives. These are the category tables that matter most.
BenchLM weights coding as follows:
That weighting is important because it reflects what changed in 2026. HumanEval still exists, but it is no longer the main signal for frontier coding models.
Top coding category scores:
| Rank | Model | Coding score |
|---|---|---|
| 1 | GPT-5.4 Pro | 86.0 |
| 2 | GPT-5.3 Codex | 85.0 |
| 3 | Kimi K2.5 (Reasoning) | 82.9 |
| 4 | Kimi K2.5 | 82.9 |
| 5 | Claude Opus 4.5 | 80.9 |
The benchmark-level picture is more fragmented:
That is a good example of why single-benchmark arguments are weak. If you care about repository bug fixing, GPT-5.4 Pro and GPT-5.3 Codex look best. If you care about fresh coding tasks, Kimi matters more. If you want the broadest balanced general-purpose model that still holds up on coding, Gemini 3.1 Pro remains competitive.
Agentic is now the heaviest category in BenchLM's formula at 22%, ahead of coding. That is a reasonable reflection of where the market is going.
Top agentic category scores:
| Rank | Model | Agentic score |
|---|---|---|
| 1 | MiMo-V2-Pro | 86.7 |
| 2 | Gemini 3 Pro Deep Think | 78.8 |
| 3 | GPT-5.4 | 77.0 |
| 4 | Gemini 3.1 Pro | 76.1 |
| 5 | Claude Opus 4.6 | 72.6 |
Two caveats matter here:
The bigger story is that agentic benchmarks still have real spread. Terminal-Bench 2.0 has a 15.7-point gap between first and fifth. BrowseComp has a 10-point gap. These are not saturated tests.
Reasoning and knowledge are where the old and new benchmark worlds collide.
Top reasoning category scores:
| Rank | Model | Reasoning score |
|---|---|---|
| 1 | GPT-5.4 Pro | 95.0 |
| 2 | GPT-5.3 Codex | 93.0 |
| 3 | Grok 4.1 | 93.0 |
| 4 | GPT-5.4 | 89.9 |
| 5 | Gemini 3.1 Pro | 88.3 |
Top knowledge category scores:
| Rank | Model | Knowledge score |
|---|---|---|
| 1 | GPT-5.4 Pro | 96.3 |
| 2 | Grok 4.1 | 95.6 |
| 3 | Claude Opus 4.5 | 95.0 |
| 4 | GPT-5.2-Codex | 95.0 |
| 5 | Gemini 3 Pro | 95.0 |
At first glance those knowledge scores look tightly clustered, and that is exactly the point. Some older knowledge benchmarks are no longer doing useful work. On MMLU, the top five models all score 99. That benchmark is now a floor check, not a frontier separator.
The better 2026 signals are:
HLE is the clearest knowledge separator right now. Claude Opus 4.6 leads it at 53, ahead of Claude Sonnet 4.6 at 49 and GPT-5.4 at 48. If you want one benchmark that still tells you something meaningful about the hard edge of frontier knowledge, HLE has the strongest argument.
This is one of the biggest strategic categories in 2026 because enterprise and agent workflows increasingly involve screenshots, spreadsheets, documents, dashboards, and mixed inputs rather than plain text.
Top multimodal category scores:
| Rank | Model | Multimodal score |
|---|---|---|
| 1 | Gemini 3.1 Pro | 95.0 |
| 2 | Claude Sonnet 4.6 | 91.9 |
| 3 | GPT-5.4 | 87.9 |
| 4 | Gemini 3 Pro | 81.0 |
| 5 | Claude 4 Sonnet | 79.7 |
The benchmark split is useful:
That means the multimodal story is not just "Gemini wins." Gemini wins the weighted category because it is elite on both tests. But GPT-5.4 deserves explicit credit for being best on the more office-workflow-shaped benchmark.
Open-weight models are now good enough that the conversation changed. They are no longer interesting only as cheap substitutes. Some are credible choices.
The strongest open-weight overall rows on BenchLM right now are:
| Rank | Model | Overall | Notes |
|---|---|---|---|
| 1 | Qwen2.5-1M | 67 | Best open-weight overall model on BenchLM |
| 2 | DeepSeek V3.2 (Thinking) | 66 | Strong reasoning-agentic compromise |
| 3 | DeepSeek Coder 2.0 | 66 | Best open coding profile in the top open-weight tier |
| 4 | Nemotron 3 Ultra 500B | 63 | Strong scale and instruction-following profile |
| 5 | Qwen3.5 397B (Reasoning) | 63 | Narrow reasoning strength, thinner breadth |
This is real progress. Qwen2.5-1M ranking #9 overall means the open-weight tier is no longer shut out of the frontier conversation.
But the gap is still visible in three places:
The practical version is simple: if you need the absolute best all-purpose model, you still end up in the proprietary tier. If you need a strong open-weight system, the open tier is now good enough that the tradeoff is technical, not symbolic.
The clearest shift in 2026 is not just which model is first. It is which benchmarks are still worth trusting as primary frontier signals.
| Benchmark | Top score | 5th place | 10th place | What it tells you |
|---|---|---|---|---|
| MMLU | 99 | 99 | 90.8 | Effectively saturated at the top |
| MMLU-Pro | 94 | 90 | 84 | Still useful for broad knowledge separation |
| HLE | 53 | 40 | 18.8 | Strongest hard-knowledge spread |
| SWE-bench Pro | 72 | 58 | 55 | Real coding separation remains |
| LiveCodeBench | 85 | 80 | 63.6 | Fresh coding tasks still create a long tail |
| Terminal-Bench 2.0 | 86.7 | 71 | 63 | One of the clearest agentic separators |
| BrowseComp | 87 | 77 | 72 | Strong research-agent benchmark |
MMLU has effectively stopped differentiating elite knowledge models. Top five score: 99, 99, 99, 99, 99.
HumanEval still provides some signal, but far less than it used to. Top five score: 99, 95, 95, 95, 93. That is better than MMLU, but still weak compared with harder coding tests.
HLE remains one of the best high-end knowledge filters because the spread is still large: top score 53, fifth score 40, tenth score 18.8.
SWE-bench Pro remains more useful than legacy coding tests because the spread is still large: top score 72, fifth 58, tenth 55.
LiveCodeBench is valuable because it uses fresher tasks and still shows a meaningful long tail: top score 85, tenth 63.6.
Terminal-Bench 2.0 is one of the clearest non-saturated agentic benchmarks in the set: top score 86.7, fifth 71, tenth 63.
BrowseComp is a strong research-agent benchmark with a meaningful spread: top 87, fifth 77.
OSWorld-Verified is slightly tighter at the top, but still useful because it measures real interface work rather than stylized QA.
The broader point is that benchmark selection has become a methodological problem. If your ranking still leans heavily on MMLU and HumanEval, it will systematically overstate certainty and understate real product differences.
The frontier no longer has one obvious value story. It has tiers.
These are the models that make the most sense for teams that need broad capability without paying flagship premiums.
These prices are hard to justify unless you specifically need the incremental specialist gains. GPT-5.4 Pro does earn its place on reasoning, knowledge, instruction following, and math. Claude Opus 4.6 does earn its place on HLE and multilingual. But most teams are not buying a benchmark trophy. They are buying a model that has to survive production economics.
This remains one of the cleanest price-performance stories in the whole market. It is a top-tier coding model by BenchLM's weighted coding view, while staying far cheaper than the flagship specialist tier.
The biggest change is not that one lab pulled ahead forever. It is that the benchmark stack got harsher.
| Area | 2025 emphasis | 2026 emphasis | Why it matters |
|---|---|---|---|
| Knowledge ranking | MMLU-heavy discussions | MMLU-Pro, GPQA, HLE | Harder tests still create spread |
| Coding quality | HumanEval and legacy SWE-bench citations | SWE-bench Pro and LiveCodeBench | Better match for real software work |
| Agent capability | Demo-driven claims | Terminal-Bench 2.0, BrowseComp, OSWorld-Verified | Action-based evaluation is harder to fake |
| Multimodal strength | Vision as a side feature | MMMU-Pro and OfficeQA Pro | Grounded document and UI tasks now matter |
| Leaderboard logic | Best single score wins headlines | Breadth and coverage matter more | Sparse rows should not decide the full table |
Compared with the older evaluation mix that centered more attention on MMLU, HumanEval, and broad chat impressions, 2026 is much more defined by:
That shift makes the leaderboard harder to game and harder to summarize lazily. It also explains why some models look better in current serious ranking systems than they did in older, more benchmark-saturated discussions.
The second change is that open-weight systems are now close enough to matter strategically, even when they are not yet winning overall. The top open models have entered the main table. They are no longer living in a separate hobbyist category.
The third change is that coverage matters more than ever. In a world where model labs selectively publish benchmark rows, the best ranking system is not the one that rewards the loudest claim. It is the one that discounts incomplete evidence.
This report uses BenchLM's current public ranking system and benchmark dataset.
Some models have outstanding narrow rows with limited total coverage. GPT-5.4 Pro is the best example among current frontier proprietary models. MiMo-V2-Pro is the clearest example in agentic. These are real signals, but they are not enough by themselves to justify an undisputed overall #1 position. BenchLM's leaderboard is intentionally conservative here.
That is the right editorial stance for a source-of-record product. If the data is partial, the claim should also be partial.
If you want the most balanced general model in BenchLM's current data, pick Gemini 3.1 Pro.
If you want the strongest all-purpose OpenAI model with broad product usefulness, pick GPT-5.4.
If you want the strongest narrow specialist row on reasoning, knowledge, and math, look at GPT-5.4 Pro, but read the coverage caveat.
If you want the strongest coding-focused value play, look at GPT-5.3 Codex.
If you want the best open-weight all-rounder, start with Qwen2.5-1M.
The larger conclusion is that "best model" is no longer a serious question without a benchmark context. In 2026, the meaningful question is: best model for which task, under which cost constraints, with how much published evidence?
→ See the full leaderboard · Coding rankings · Reasoning rankings · Knowledge rankings · Agentic rankings · Pricing · What benchmarks actually measure · Are AI benchmarks reliable?
What is the best AI model in 2026?
On BenchLM's current data, Gemini 3.1 Pro is #1 overall with a score of 83. It wins because it is strong across every major category rather than dominating only one benchmark family.
Which LLM is best for coding in 2026?
The weighted coding category is led by GPT-5.4 Pro at 86, followed by GPT-5.3 Codex at 85. But the benchmark-level answer depends on what you care about: GPT-5.4 Pro leads SWE-bench Verified, Gemini 3.1 Pro leads SWE-bench Pro, and Kimi K2.5 leads LiveCodeBench.
Are open-weight models close to proprietary models in 2026?
Closer, yes. Equal, no. Qwen2.5-1M ranks #9 overall and DeepSeek V3.2 (Thinking) ranks #16, which means the open-weight tier is now competitive enough to matter in serious model selection. The remaining gap is largest in coding depth, multimodal quality, and broad all-category consistency.
Which benchmarks matter most in 2026?
The strongest current separators are HLE, MMLU-Pro, GPQA, SWE-bench Pro, LiveCodeBench, Terminal-Bench 2.0, BrowseComp, OSWorld-Verified, MMMU-Pro, and OfficeQA Pro. MMLU and HumanEval still provide context, but they are weaker primary ranking signals now.
Are older benchmarks like MMLU still useful?
Only as a baseline. MMLU no longer separates top frontier models in a meaningful way because the best systems all cluster at 99.
How does BenchLM rank models?
BenchLM uses weighted category scores rather than a flat average. Agentic and coding carry the most weight, and within each category the benchmarks that still produce real separation carry more influence than saturated legacy tests.
All data sourced from BenchLM.ai. Dataset last updated March 18, 2026.
Get weekly benchmark updates in your inbox.
Free. No spam. Unsubscribe anytime. We only store derived location metadata for consent routing.
Share This Report
Copy the link, post it, or save a PDF version.
On this page
New model releases, benchmark scores, and leaderboard changes. Every Friday.
Free. Your signup is stored with a derived country code for compliance routing.
Which budget LLM should you use in 2026? We rank GPT-5.4 mini, GPT-5.4 nano, MiniMax M2.7, Claude Haiku 4.5, Gemini Flash, DeepSeek, and more by benchmarks and price.
Which AI model is best for coding in 2026? We rank major LLMs by SWE-bench Pro and LiveCodeBench, with SWE-bench Verified shown as a historical baseline — plus pricing and use-case guidance.
Claude Opus 4.6 vs GPT-5.4 head-to-head: benchmark scores, pricing, and when to use each. GPT-5.4 leads on 16 of 20 benchmarks at 6x lower cost. But Claude holds real advantages in some areas.