What is the best LLM for reasoning?

The top reasoning LLMs are ranked using benchmarks like MuSR, SimpleQA, LongBench v2, and MRCRv2, which test logical deduction, multi-step reasoning, factual accuracy, and long-context discipline.

How do reasoning benchmarks evaluate LLMs?

Reasoning benchmarks evaluate LLMs by presenting tasks that require multi-step logical deduction, causal inference, and complex problem solving beyond simple pattern matching.

What is the difference between reasoning and knowledge benchmarks?

Reasoning benchmarks test logical thinking and multi-step inference, while knowledge benchmarks focus on factual recall. A model can have strong reasoning but limited factual knowledge, or vice versa.

Reasoning

Reasoning Benchmarks

Name: Reasoning Benchmarks — LLM Leaderboard
Creator: BenchLM.ai

Logical reasoning and problem solving

SimpleQA · MuSR · BBH · LongBench v2 · MRCRv2

Reasoning benchmarks evaluate whether AI models can think logically, chain multiple steps of inference, and handle factual accuracy under pressure. This category carries a 17% weight in BenchLM.ai's overall scoring, reflecting the growing importance of long-context reasoning in production systems.

BenchLM.ai tracks five complementary reasoning benchmarks: SimpleQA measures short-form factual accuracy, MuSR tests multi-step soft reasoning across paragraphs of context, BBH (BIG-Bench Hard) remains as a historical baseline, and LongBench v2 plus MRCRv2 measure whether models can actually use long context windows instead of merely advertising them.

Reasoning performance is where the gap between "reasoning" and "non-reasoning" model architectures is most visible. Models with explicit chain-of-thought capabilities tend to outperform standard models by significant margins on MuSR and BBH, though at the cost of higher latency and token usage. See our reasoning rankings or compare specific models using our LLM selector quiz.

124 models


1 GPT-5.4 Pro OpenAI	Closed	Reasoning	1.05M	91	97%	95%	98%	95%	97%
2 GPT-5.2 Pro OpenAI	Closed	Reasoning	400K	90	97%	95%	98%	93%	95%
3 GPT-5.4 OpenAI	Closed	Reasoning	1.05M	90	97%	94%	97%	95%	97%
4 GPT-5.3 Codex OpenAI	Closed	Reasoning	400K	89	95%	93%	98%	92%	93%
5 GPT-5.2 OpenAI	Closed	Reasoning	400K	88	95%	93%	96%	91%	93%
6 GPT-5.3 Instant OpenAI	Closed	Reasoning	128K	87	96%	94%	97%	92%	94%
7 GPT-5.3-Codex-Spark OpenAI	Closed	Reasoning	256K	87	94%	92%	97%	91%	92%
8 Claude Opus 4.6 Anthropic	Closed	Standard	1M	85	95%	93%	94%	92%	92%
9 GPT-5.2 Instant OpenAI	Closed	Reasoning	128K	85	95%	93%	96%	89%	84%
10 GPT-5.2-Codex OpenAI	Closed	Reasoning	400K	85	95%	93%	90%	90%	91%
11 Gemini 3.1 Pro Google	Closed	Standard	1M	84	95%	93%	92%	93%	90%
12 GPT-5.1-Codex-Max OpenAI	Closed	Reasoning	400K	84	94%	92%	92%	90%	93%
13 Grok 4.1 xAI	Closed	Standard	1M	84	95%	93%	93%	90%	89%
14 Gemini 3 Pro Deep Think Google	Closed	Reasoning	2M	81	95%	93%	95%	94%	96%
15 GPT-5.1 OpenAI	Closed	Reasoning	200K	80	93%	91%	92%	84%	84%
16 GPT-5 (high) OpenAI	Closed	Reasoning	128K	79	89%	87%	94%	83%	80%
17 Claude Sonnet 4.6 Anthropic	Closed	Standard	200K	78	95%	93%	88%	83%	79%
18 GLM-5 (Reasoning) Zhipu AI	Open	Reasoning	200K	78	92%	90%	91%	86%	87%
19 GPT-5 (medium) OpenAI	Closed	Reasoning	128K	78	87%	85%	92%	81%	81%
20 Claude Opus 4.5 Anthropic	Closed	Standard	200K	77	95%	93%	87%	82%	81%
21 Gemini 3 Pro Google	Closed	Standard	2M	77	95%	93%	90%	90%	87%
22 o1-preview OpenAI	Closed	Reasoning	200K	77	88%	86%	93%	87%	83%
23 Claude Sonnet 4.5 Anthropic	Closed	Standard	200K	76	91%	89%	88%	82%	81%
24 Grok 4.1 Fast xAI	Closed	Standard	1M	76	90%	88%	87%	87%	89%
25 Kimi K2.5 (Reasoning) Moonshot AI	Closed	Reasoning	128K	76	88%	86%	91%	82%	81%

Showing 25 of 124

Reasoning benchmark updates

Get notified when GPQA, ARC-AGI, or reasoning scores change.

Free. No spam. Unsubscribe anytime. We only store derived location metadata for consent routing.

About Reasoning Benchmarks

Factual question answering benchmark