Reasoning Benchmarks — Long Context, MRCR & Multi-Step Inference Leaderboard
Logical reasoning and problem solving
Bottom line: Reasoning models (chain-of-thought) dominate this category, but standard models are closing the gap on shorter-context tasks.
MuSR · BBH · LisanBench · LongBench v2 · MRCRv2 · MRCR v2 64K-128K · MRCR v2 128K-256K · Graphwalks BFS 128K · Graphwalks Parents 128K · ARC-AGI-2
Best Reasoning picks
BenchLM summaries for reasoning plus the practical tradeoffs users check next: open weights, price, speed, latency, and context.
Top AI Models for Reasoning — April 2026
As of April 2026, Gemini 3.1 Pro leads the provisional reasoning leaderboard with a weighted score of 97.0%, followed by GPT-5.3 Codex (94.6%) and GPT-5.4 (93.0%). BenchLM is currently showing 93 provisional-ranked models and 0 verified-ranked models in this category.
Gemini 3.1 Pro
GPT-5.3 Codex
OpenAI
GPT-5.4
OpenAI
What changed
Claude Mythos Preview leads reasoning with the strongest MRCRv2 and MuSR scores.
GPT-5.4 strong #2, with a notable edge on LongBench v2.
Claude Opus 4.6 holds #3, excelling on long-context reasoning tasks.
How to choose
Top models by benchmark
Complex multi-step reasoning problems(20% of category score)
Reasoning Leaderboard
Updated April 16, 2026Sorted by reasoning weighted score. Switch between provisional-ranked and verified-ranked modes to see the broader public dataset versus sourced-only ranking. Click column headers to re-sort by overall score or any benchmark.
1 Gemini 3.1 Pro Google | 97% | 94 | — | — | — | — | — | — | — | — | — | 77.1% |
2 GPT-5.3 Codex OpenAI | 94.6% | Est.89 | — | — | — | — | — | — | — | — | — | — |
3 GPT-5.4 OpenAI | 93% | 93 | — | — | — | — | — | — | — | — | — | — |
4 Grok 4.1 xAI | 91.7% | Est.80 | — | — | — | — | — | — | — | — | — | — |
5 Claude Opus 4.6 Anthropic | 90% | 92 | — | — | — | — | — | — | — | — | — | — |
6 GPT-5.1-Codex-Max OpenAI | 89.8% | Est.79 | — | — | — | — | — | — | — | — | — | — |
7 Gemini 3 Pro Deep Think Google | 89% | Est.87 | — | — | — | — | — | — | — | — | — | 45.1% |
8 | 88.9% | Est.72 | — | — | — | — | — | — | — | — | — | — |
9 GPT-5.2-Codex OpenAI | 88.5% | Est.80 | — | — | — | — | — | — | — | — | — | — |
10 | 88.2% | Est.84 | — | — | — | — | — | — | — | — | — | — |
11 GPT-5.2 OpenAI | 85.3% | Est.83 | — | — | — | — | — | — | — | — | — | 52.9% |
12 Claude Sonnet 4.6 Anthropic | 82.5% | 86 | — | — | — | — | — | — | — | — | — | — |
13 Gemini 3 Pro Google | 82.4% | Est.83 | — | — | — | — | — | — | — | — | — | 31.1% |
14 Qwen3.5 397B (Reasoning) Alibaba | 82.2% | Est.81 | — | — | — | — | — | — | — | — | — | — |
15 GPT-4.1 mini OpenAI | 79.4% | Est.47 | — | — | — | — | — | — | — | — | — | — |
16 GPT-4.1 OpenAI | 77.5% | Est.60 | — | — | — | — | — | — | — | — | — | — |
17 GPT-5 (high) OpenAI | 77.1% | Est.80 | — | — | — | — | — | — | — | — | — | — |
18 o1-preview OpenAI | 76.1% | Est.68 | — | — | — | — | — | — | — | — | — | — |
19 GPT-5 (medium) OpenAI | 74.9% | Est.74 | — | — | — | — | — | — | — | — | — | — |
20 o1 OpenAI | 74.2% | Est.59 | — | — | — | — | — | — | — | — | — | — |
21 GLM-4.7 Z.AI | 73.1% | Est.72 | — | — | — | — | — | — | — | — | — | — |
22 o3-pro OpenAI | 71% | Est.59 | — | — | — | — | — | — | — | — | — | — |
23 Qwen2.5-1M Alibaba | 71% | Est.53 | — | — | — | — | — | — | — | — | — | — |
24 Claude Opus 4.5 Anthropic | 70.1% | 80 | — | — | — | 64.4% | — | — | — | — | — | — |
25 Kimi K2.5 (Reasoning) Moonshot AI | 69.8% | Est.79 | — | — | — | — | — | — | — | — | — | — |
These rankings update weekly
Get notified when models move. One email a week with what changed and why.
Free. No spam. Unsubscribe anytime.
Score in Context
What these scores mean
Reasoning carries a 17% weight in overall scoring. The weighted score blends long-context comprehension (LongBench v2, MRCRv2), multi-step inference (MuSR), and novel problem solving (ARC-AGI-2). A 5-point gap here usually means the difference between a model that tracks complex argument chains reliably and one that loses the thread.
Known limitations
Models with explicit chain-of-thought (reasoning models) tend to outperform standard models by large margins, but at the cost of higher latency and token usage. ARC-AGI-2 is still early — coverage is uneven, and some models lack scores. MuSR is underrepresented because few providers run it.
How we weight
Reasoning carries a 17% weight in BenchLM.ai's overall scoring, reflecting the growing importance of long-context reasoning in production systems.
Models with explicit chain-of-thought capabilities tend to outperform standard models by significant margins on MuSR and long-context tasks, though at the cost of higher latency and token usage. See the reasoning leaderboard or try the LLM selector quiz.
Leaderboards exclude benchmark rows that BenchLM generated from other scores or cloned from reference models. When a weighted benchmark is missing after that filter, the category falls back to the remaining trustworthy public rows instead of filling the gap with synthetic values.
The full scoring rules, freshness handling, and runtime/pricing caveats live on the BenchLM methodology page.
| Benchmark | Weight | Status | Description |
|---|---|---|---|
| MuSR | 20% | Weighted | Complex multi-step reasoning problems |
| BBH | — | Display only | 23 challenging tasks from BIG-Bench where language models previously underperformed humans |
| LisanBench | — | Display only | Word-chain reasoning benchmark for planning, recall, and constraint following. |
| LongBench v2 | 30% | Weighted | Long-context reasoning and retrieval benchmark |
| MRCRv2 | 25% | Weighted | Multi-round coreference and retrieval benchmark for long-context models |
| MRCR v2 64K-128K | — | Display only | Long-context retrieval benchmark slice focused on 64K-128K context lengths |
| MRCR v2 128K-256K | — | Display only | Long-context retrieval benchmark slice focused on 128K-256K context lengths |
| Graphwalks BFS 128K | — | Display only | Long-context graph traversal benchmark using breadth-first search tasks |
| Graphwalks Parents 128K | — | Display only | Long-context graph reasoning benchmark for parent-retrieval accuracy |
| ARC-AGI-2 | 25% | Weighted | Abstract reasoning |
Reasoning benchmark updates
Reasoning benchmarks shift fast. Get the update before you commit to a model.
Free. No spam. Unsubscribe anytime.
About Reasoning Benchmarks
Complex multi-step reasoning problems