Reasoning

Reasoning Benchmarks

Logical reasoning and problem solving

SimpleQA · MuSR · BBH · LongBench v2 · MRCRv2

Reasoning benchmarks evaluate whether AI models can think logically, chain multiple steps of inference, and handle factual accuracy under pressure. This category carries a 17% weight in BenchLM.ai's overall scoring, reflecting the growing importance of long-context reasoning in production systems.

BenchLM.ai tracks five complementary reasoning benchmarks: SimpleQA measures short-form factual accuracy, MuSR tests multi-step soft reasoning across paragraphs of context, BBH (BIG-Bench Hard) remains as a historical baseline, and LongBench v2 plus MRCRv2 measure whether models can actually use long context windows instead of merely advertising them.

Reasoning performance is where the gap between "reasoning" and "non-reasoning" model architectures is most visible. Models with explicit chain-of-thought capabilities tend to outperform standard models by significant margins on MuSR and BBH, though at the cost of higher latency and token usage. See our reasoning rankings or compare specific models using our LLM selector quiz.

124 models
1
GPT-5.4 Pro
OpenAI
ClosedReasoning1.05M9197%95%98%95%97%
2
GPT-5.2 Pro
OpenAI
ClosedReasoning400K9097%95%98%93%95%
3
GPT-5.4
OpenAI
ClosedReasoning1.05M9097%94%97%95%97%
4
GPT-5.3 Codex
OpenAI
ClosedReasoning400K8995%93%98%92%93%
5
GPT-5.2
OpenAI
ClosedReasoning400K8895%93%96%91%93%
6
GPT-5.3 Instant
OpenAI
ClosedReasoning128K8796%94%97%92%94%
7
GPT-5.3-Codex-Spark
OpenAI
ClosedReasoning256K8794%92%97%91%92%
8
Claude Opus 4.6
Anthropic
ClosedStandard1M8595%93%94%92%92%
9
GPT-5.2 Instant
OpenAI
ClosedReasoning128K8595%93%96%89%84%
10
GPT-5.2-Codex
OpenAI
ClosedReasoning400K8595%93%90%90%91%
11
Gemini 3.1 Pro
Google
ClosedStandard1M8495%93%92%93%90%
12
GPT-5.1-Codex-Max
OpenAI
ClosedReasoning400K8494%92%92%90%93%
13
Grok 4.1
xAI
ClosedStandard1M8495%93%93%90%89%
14
Gemini 3 Pro Deep Think
Google
ClosedReasoning2M8195%93%95%94%96%
15
GPT-5.1
OpenAI
ClosedReasoning200K8093%91%92%84%84%
16
GPT-5 (high)
OpenAI
ClosedReasoning128K7989%87%94%83%80%
17
Claude Sonnet 4.6
Anthropic
ClosedStandard200K7895%93%88%83%79%
18
GLM-5 (Reasoning)
Zhipu AI
OpenReasoning200K7892%90%91%86%87%
19
GPT-5 (medium)
OpenAI
ClosedReasoning128K7887%85%92%81%81%
20
Claude Opus 4.5
Anthropic
ClosedStandard200K7795%93%87%82%81%
21
Gemini 3 Pro
Google
ClosedStandard2M7795%93%90%90%87%
22
o1-preview
OpenAI
ClosedReasoning200K7788%86%93%87%83%
23
Claude Sonnet 4.5
Anthropic
ClosedStandard200K7691%89%88%82%81%
24
Grok 4.1 Fast
xAI
ClosedStandard1M7690%88%87%87%89%
25
Kimi K2.5 (Reasoning)
Moonshot AI
ClosedReasoning128K7688%86%91%82%81%
Showing 25 of 124

About Reasoning Benchmarks

Factual question answering benchmark