Reasoning Benchmarks
Logical reasoning and problem solving
SimpleQA · MuSR · BBH · LongBench v2 · MRCRv2
Reasoning benchmarks evaluate whether AI models can think logically, chain multiple steps of inference, and handle factual accuracy under pressure. This category carries a 17% weight in BenchLM.ai's overall scoring, reflecting the growing importance of long-context reasoning in production systems.
BenchLM.ai tracks five complementary reasoning benchmarks: SimpleQA measures short-form factual accuracy, MuSR tests multi-step soft reasoning across paragraphs of context, BBH (BIG-Bench Hard) remains as a historical baseline, and LongBench v2 plus MRCRv2 measure whether models can actually use long context windows instead of merely advertising them.
Reasoning performance is where the gap between "reasoning" and "non-reasoning" model architectures is most visible. Models with explicit chain-of-thought capabilities tend to outperform standard models by significant margins on MuSR and BBH, though at the cost of higher latency and token usage. See our reasoning rankings or compare specific models using our LLM selector quiz.
1 GPT-5.4 Pro OpenAI | Closed | Reasoning | 1.05M | 91 | 97% | 95% | 98% | 95% | 97% |
2 GPT-5.2 Pro OpenAI | Closed | Reasoning | 400K | 90 | 97% | 95% | 98% | 93% | 95% |
3 GPT-5.4 OpenAI | Closed | Reasoning | 1.05M | 90 | 97% | 94% | 97% | 95% | 97% |
4 GPT-5.3 Codex OpenAI | Closed | Reasoning | 400K | 89 | 95% | 93% | 98% | 92% | 93% |
5 GPT-5.2 OpenAI | Closed | Reasoning | 400K | 88 | 95% | 93% | 96% | 91% | 93% |
6 GPT-5.3 Instant OpenAI | Closed | Reasoning | 128K | 87 | 96% | 94% | 97% | 92% | 94% |
7 GPT-5.3-Codex-Spark OpenAI | Closed | Reasoning | 256K | 87 | 94% | 92% | 97% | 91% | 92% |
8 Claude Opus 4.6 Anthropic | Closed | Standard | 1M | 85 | 95% | 93% | 94% | 92% | 92% |
9 GPT-5.2 Instant OpenAI | Closed | Reasoning | 128K | 85 | 95% | 93% | 96% | 89% | 84% |
10 GPT-5.2-Codex OpenAI | Closed | Reasoning | 400K | 85 | 95% | 93% | 90% | 90% | 91% |
11 Gemini 3.1 Pro Google | Closed | Standard | 1M | 84 | 95% | 93% | 92% | 93% | 90% |
12 GPT-5.1-Codex-Max OpenAI | Closed | Reasoning | 400K | 84 | 94% | 92% | 92% | 90% | 93% |
13 Grok 4.1 xAI | Closed | Standard | 1M | 84 | 95% | 93% | 93% | 90% | 89% |
14 Gemini 3 Pro Deep Think Google | Closed | Reasoning | 2M | 81 | 95% | 93% | 95% | 94% | 96% |
15 GPT-5.1 OpenAI | Closed | Reasoning | 200K | 80 | 93% | 91% | 92% | 84% | 84% |
16 GPT-5 (high) OpenAI | Closed | Reasoning | 128K | 79 | 89% | 87% | 94% | 83% | 80% |
17 Claude Sonnet 4.6 Anthropic | Closed | Standard | 200K | 78 | 95% | 93% | 88% | 83% | 79% |
18 GLM-5 (Reasoning) Zhipu AI | Open | Reasoning | 200K | 78 | 92% | 90% | 91% | 86% | 87% |
19 GPT-5 (medium) OpenAI | Closed | Reasoning | 128K | 78 | 87% | 85% | 92% | 81% | 81% |
20 Claude Opus 4.5 Anthropic | Closed | Standard | 200K | 77 | 95% | 93% | 87% | 82% | 81% |
21 Gemini 3 Pro Google | Closed | Standard | 2M | 77 | 95% | 93% | 90% | 90% | 87% |
22 o1-preview OpenAI | Closed | Reasoning | 200K | 77 | 88% | 86% | 93% | 87% | 83% |
23 Claude Sonnet 4.5 Anthropic | Closed | Standard | 200K | 76 | 91% | 89% | 88% | 82% | 81% |
24 Grok 4.1 Fast xAI | Closed | Standard | 1M | 76 | 90% | 88% | 87% | 87% | 89% |
25 Kimi K2.5 (Reasoning) Moonshot AI | Closed | Reasoning | 128K | 76 | 88% | 86% | 91% | 82% | 81% |
About Reasoning Benchmarks
Factual question answering benchmark