Reasoning Benchmarks

Logical reasoning and problem solving - Compare AI models across 2 logical reasoning benchmarks including SimpleQA, MuSR, and more.

Filters & Search

Filter models by creator, type, reasoning, or search by name to find the perfect AI model for your needs

Reasoning Benchmark Results

Showing 25 of 52 models • Click column headers to sort

1
GPT-5 (high)
OpenAI
OpenAIProprietaryReasoning128K7289%87%
2
o1-preview
OpenAI
OpenAIProprietaryReasoning200K7188%86%
3
GPT-5 (medium)
OpenAI
OpenAIProprietaryReasoning128K7087%85%
4
Grok 4
xAI
xAIProprietaryNon-Reasoning128K6983%81%
5
GPT-5 mini
OpenAI
OpenAIProprietaryReasoning128K6884%82%
6
o3-pro
OpenAI
OpenAIProprietaryReasoning200K6886%84%
7
o3
OpenAI
OpenAIProprietaryReasoning200K6784%82%
8
Qwen2.5-1M
Alibaba
AlibabaOpen WeightNon-Reasoning1M6681%79%
9
Qwen2.5-72B
Alibaba
AlibabaOpen WeightNon-Reasoning128K6580%78%
10
o4-mini (high)
OpenAI
OpenAIProprietaryNon-Reasoning200K6580%78%
11
Gemini 2.5 Pro
Google
GoogleProprietaryNon-Reasoning2M6581%79%
12
DeepSeek Coder 2.0
DeepSeek
DeepSeekOpen WeightNon-Reasoning128K6478%76%
13
DeepSeek LLM 2.0
DeepSeek
DeepSeekOpen WeightNon-Reasoning128K6377%75%
14
Claude 4.1 Opus
Anthropic
AnthropicProprietaryNon-Reasoning200K6174%72%
15
Claude 4 Sonnet
Anthropic
AnthropicProprietaryNon-Reasoning200K5971%69%
16
Llama 3.1 405B
Meta
MetaOpen WeightNon-Reasoning128K5868%66%
17
Mistral Large 2
Mistral
MistralProprietaryNon-Reasoning128K5766%64%
18
GPT-4o
OpenAI
OpenAIProprietaryNon-Reasoning128K5664%62%
19
Claude 3.5 Sonnet
Anthropic
AnthropicProprietaryNon-Reasoning200K5563%61%
20
Gemini 1.5 Pro
Google
GoogleProprietaryNon-Reasoning2M5462%60%
21
Mistral 8x7B
Mistral
MistralOpen WeightNon-Reasoning32K5263%61%
22
Gemini 1.0 Pro
Google
GoogleProprietaryNon-Reasoning32K5260%58%
23
Claude 3 Opus
Anthropic
AnthropicProprietaryNon-Reasoning200K5159%57%
24
GPT-4 Turbo
OpenAI
OpenAIProprietaryNon-Reasoning128K5058%56%
25
Llama 3 70B
Meta
MetaOpen WeightNon-Reasoning128K4856%54%

Showing 25 of 52 models

About Reasoning Benchmarks

SimpleQA

Factual question answering benchmark

MuSR

Complex multi-step reasoning problems