Skip to main content
Skip to main content
Reasoning

Reasoning Benchmarks — Long Context, MRCR & Multi-Step Inference Leaderboard

Logical reasoning and problem solving

Bottom line: Reasoning models (chain-of-thought) dominate this category, but standard models are closing the gap on shorter-context tasks.

MuSR · BBH · LisanBench · LongBench v2 · MRCRv2 · MRCR v2 64K-128K · MRCR v2 128K-256K · Graphwalks BFS 128K · Graphwalks Parents 128K · ARC-AGI-2

Abstract reasoningLong-context reasoning

Best Reasoning picks

BenchLM summaries for reasoning plus the practical tradeoffs users check next: open weights, price, speed, latency, and context.

Top AI Models for ReasoningApril 2026

As of April 2026, Gemini 3.1 Pro leads the provisional reasoning leaderboard with a weighted score of 97.0%, followed by GPT-5.3 Codex (94.6%) and GPT-5.4 (93.0%). BenchLM is currently showing 93 provisional-ranked models and 0 verified-ranked models in this category.

What changed

Claude Mythos Preview leads reasoning with the strongest MRCRv2 and MuSR scores.

GPT-5.4 strong #2, with a notable edge on LongBench v2.

Claude Opus 4.6 holds #3, excelling on long-context reasoning tasks.

How to choose

Top models by benchmark

Complex multi-step reasoning problems(20% of category score)

Reasoning Leaderboard

Updated April 16, 2026

Sorted by reasoning weighted score. Switch between provisional-ranked and verified-ranked modes to see the broader public dataset versus sourced-only ranking. Click column headers to re-sort by overall score or any benchmark.

93 ranked models
CSVJSON
Provisional-ranked mode includes source-unverified non-generated benchmark evidence.P = provisional benchmark row
97%
94
77.1%
94.6%
Est.89
3
GPT-5.4
OpenAI
93%
93
91.7%
Est.80
5
90%
92
89.8%
Est.79
89%
Est.87
45.1%
88.9%
Est.72
88.5%
Est.80
88.2%
Est.84
11
GPT-5.2
OpenAI
85.3%
Est.83
52.9%
82.5%
86
13
82.4%
Est.83
31.1%
82.2%
Est.81
15
79.4%
Est.47
16
GPT-4.1
OpenAI
77.5%
Est.60
17
77.1%
Est.80
18
76.1%
Est.68
74.9%
Est.74
20
o1
OpenAI
74.2%
Est.59
21
73.1%
Est.72
22
o3-pro
OpenAI
71%
Est.59
23
Qwen2.5-1M
Alibaba
71%
Est.53
24
70.1%
80
64.4%
69.8%
Est.79
Showing 25 of 93

These rankings update weekly

Get notified when models move. One email a week with what changed and why.

Free. No spam. Unsubscribe anytime.

Score in Context

What these scores mean

Reasoning carries a 17% weight in overall scoring. The weighted score blends long-context comprehension (LongBench v2, MRCRv2), multi-step inference (MuSR), and novel problem solving (ARC-AGI-2). A 5-point gap here usually means the difference between a model that tracks complex argument chains reliably and one that loses the thread.

Known limitations

Models with explicit chain-of-thought (reasoning models) tend to outperform standard models by large margins, but at the cost of higher latency and token usage. ARC-AGI-2 is still early — coverage is uneven, and some models lack scores. MuSR is underrepresented because few providers run it.

How we weight

Reasoning carries a 17% weight in BenchLM.ai's overall scoring, reflecting the growing importance of long-context reasoning in production systems.

Models with explicit chain-of-thought capabilities tend to outperform standard models by significant margins on MuSR and long-context tasks, though at the cost of higher latency and token usage. See the reasoning leaderboard or try the LLM selector quiz.

Leaderboards exclude benchmark rows that BenchLM generated from other scores or cloned from reference models. When a weighted benchmark is missing after that filter, the category falls back to the remaining trustworthy public rows instead of filling the gap with synthetic values.

The full scoring rules, freshness handling, and runtime/pricing caveats live on the BenchLM methodology page.

BenchmarkWeightStatusDescription
MuSR20%WeightedComplex multi-step reasoning problems
BBHDisplay only23 challenging tasks from BIG-Bench where language models previously underperformed humans
LisanBenchDisplay onlyWord-chain reasoning benchmark for planning, recall, and constraint following.
LongBench v230%WeightedLong-context reasoning and retrieval benchmark
MRCRv225%WeightedMulti-round coreference and retrieval benchmark for long-context models
MRCR v2 64K-128KDisplay onlyLong-context retrieval benchmark slice focused on 64K-128K context lengths
MRCR v2 128K-256KDisplay onlyLong-context retrieval benchmark slice focused on 128K-256K context lengths
Graphwalks BFS 128KDisplay onlyLong-context graph traversal benchmark using breadth-first search tasks
Graphwalks Parents 128KDisplay onlyLong-context graph reasoning benchmark for parent-retrieval accuracy
ARC-AGI-225%WeightedAbstract reasoning

Reasoning benchmark updates

Reasoning benchmarks shift fast. Get the update before you commit to a model.

Free. No spam. Unsubscribe anytime.

About Reasoning Benchmarks

Complex multi-step reasoning problems

Related