Skip to main content

Testing the Limits of Chain-of-thought with Multistep Soft Reasoning (MuSR)

A dataset for evaluating language models on multistep soft reasoning tasks specified in natural language narratives. Tests the ability to perform complex, structured reasoning.

How BenchLM shows MuSR right now

BenchLM is tracking MuSR in the local dataset, but exact-source verification records for these rows are still being attached. To avoid a blank benchmark page, BenchLM shows the current tracked rows below as a display-only reference table.

These tracked rows are useful for inspection and spot-checking, but until exact-source attachments are completed they should not be treated as fully verified public benchmark rows.

114 tracked modelsLocal tracked rowsAwaiting exact-source attachmentsDisplay only

Tracked score on MuSR — April 20, 2026

BenchLM mirrors the published tracked score view for MuSR. GPT-5.2 Pro leads the public snapshot at 95% , followed by GPT-5.4 (94%) and GPT-5.3 Instant (94%). BenchLM does not use these results to rank models overall.

114 modelsReasoning20% of category scoreStaleUpdated April 20, 2026

The published MuSR snapshot is tightly clustered at the top: GPT-5.2 Pro sits at 95%, while the third row is only 1.0 points behind. The broader top-10 spread is 2.0 points, so many of the published scores sit in a relatively narrow band.

114 models have been evaluated on MuSR. The benchmark falls in the Reasoning category. This category carries a 17% weight in BenchLM.ai's overall scoring system. Within that category, MuSR contributes 20% of the category score, so strong performance here directly affects a model's overall ranking.

About MuSR

Year

2023

Tasks

Multi-step reasoning

Format

Narrative-based reasoning

Difficulty

Complex reasoning tasks

MuSR challenges models to perform multistep reasoning over complex narratives. Unlike simple factual questions, it requires models to track multiple entities, relationships, and logical steps across extended contexts.

BenchLM freshness & provenance

Version

MuSR 2023

Refresh cadence

Static

Staleness state

Stale

Question availability

Public benchmark set

Stale

BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.

Tracked score table (114 models)

1
GPT-5.2 Progpt-5-2-pro
95%
2
GPT-5.4gpt-5-4
94%
3
GPT-5.3 Instantgpt-5-3-instant
94%
4
GPT-5.2-Codexgpt-5-2-codex
93%
5
GPT-5.3 Codexgpt-5-3-codex
93%
6
Grok 4.1grok-4-1
93%
7
Gemini 3 Pro Deep Thinkgemini-3-pro-deep-think
93%
8
Gemini 3.1 Progemini-3-1-pro
93%
9
Claude Opus 4.6claude-opus-4-6
93%
10
GPT-5.2gpt-5-2
93%
11
Claude Sonnet 4.6claude-sonnet-4-6
93%
12
Gemini 3 Progemini-3-pro
93%
13
Claude Opus 4.5claude-opus-4-5
93%
14
GPT-5.2 Instantgpt-5-2-instant
93%
15
GPT-5.1-Codex-Maxgpt-5-1-codex-max
92%
16
GPT-5.3-Codex-Sparkgpt-5-3-codex-spark
92%
17
GPT-5.1gpt-5-1
91%
18
GLM-5 (Reasoning)glm-5-reasoning
90%
19
Claude Sonnet 4.5claude-sonnet-4-5
89%
20
Grok 4.1 Fastgrok-4-1-fast
88%
21
GPT-5 (high)gpt-5-high
87%
22
86%
23
Kimi K2.5 (Reasoning)kimi-k2-5-reasoning
86%
24
GPT-5 (medium)gpt-5-medium
85%
25
Qwen3.5 397B (Reasoning)qwen3-5-397b-reasoning
85%
26
84%
27
GLM-5.1glm-5-1
82%
28
82%
29
GLM-5glm-5
82%
30
Step 3.5 Flashstep-3-5-flash
82%
31
GPT-5 minigpt-5-mini
82%
32
Mercury 2mercury-2
82%
33
Grok 4grok-4
81%
34
DeepSeek V3.2 (Thinking)deepseek-v3-2-thinking
81%
35
GLM-4.7glm-4-7
80%
36
Qwen2.5-1Mqwen2-5-1m
79%
37
Gemini 2.5 Progemini-2-5-pro
79%
38
DeepSeek V3.2deepseek-v3-2
79%
39
Qwen3.5 397Bqwen3-5-397b
78%
40
Qwen2.5-72Bqwen2-5-72b
78%
41
o4-mini (high)o4-mini-high
78%
42
DeepSeek Coder 2.0deepseek-coder-2-0
76%
43
DeepSeekMath V2deepseekmath-v2
75%
44
DeepSeek LLM 2.0deepseek-llm-2-0
75%
45
MiMo-V2-Flashmimo-v2-flash
74%
46
Aion-2.0aion-2-0
74%
47
Kimi K2.5kimi-k2-5
72%
48
Claude 4.1 Opusclaude-4-1-opus
72%
49
Claude 4.1 Opus Thinkingclaude-4-1-opus-thinking
72%
50
Mistral Large 3mistral-large-3
71%
51
Ministral 3 14B (Reasoning)ministral-3-14b-reasoning
70%
52
Nemotron 3 Ultra 500Bnemotron-3-ultra-500b
69%
53
Claude 4 Sonnetclaude-4-sonnet
69%
54
Seed 1.6seed-1-6
69%
55
MiniMax M2.5minimax-m2-5
68%
56
Llama 3.1 405Bllama-3-1-405b
66%
57
Seed-2.0-Liteseed-2-0-lite
66%
58
Gemini 3 Flashgemini-3-flash
65%
59
Mistral Large 2mistral-large-2
64%
60
Ministral 3 14Bministral-3-14b
64%
61
Claude Haiku 4.5claude-haiku-4-5
63%
62
GPT-4ogpt-4o
62%
63
Nemotron 3 Super 120B A12Bnemotron-3-super-120b-a12b
62%
64
Claude 3.5 Sonnetclaude-3-5-sonnet
61%
65
Mistral 8x7Bmistral-8x7b
61%
66
GLM-4.7-Flashglm-4-7-flash
61%
67
Nemotron 3 Super 100Bnemotron-3-super-100b
60%
68
Gemini 1.5 Progemini-1-5-pro
60%
69
Grok Code Fast 1grok-code-fast-1
59%
70
Seed 1.6 Flashseed-1-6-flash
59%
71
Gemini 3.1 Flash-Litegemini-3-1-flash-lite
58%
72
Gemini 1.0 Progemini-1-0-pro
58%
73
Claude 3 Opusclaude-3-opus
57%
74
Seed-2.0-Miniseed-2-0-mini
57%
75
Ternary Bonsai 8Bternary-bonsai-8b
56.2%
76
GPT-4 Turbogpt-4-turbo
56%
77
Llama 3 70Bllama-3-70b
54%
78
Claude 3 Haikuclaude-3-haiku
52%
79
Nemotron 3 Nano 30Bnemotron-3-nano-30b
52%
80
Ternary Bonsai 1.7Bternary-bonsai-1-7b
50.8%
81
50%
82
Nemotron-4 15Bnemotron-4-15b
50%
83
Moonshot v1moonshot-v1
49%
84
Z-1z-1
48%
85
GPT-OSS 120Bgpt-oss-120b
47%
86
Gemini 2.5 Flashgemini-2-5-flash
46%
87
Ternary Bonsai 4Bternary-bonsai-4b
45.1%
88
45.1%
89
Nemotron Ultra 253Bnemotron-ultra-253b
45%
90
Llama 4 Behemothllama-4-behemoth
44%
91
Llama 4 Scoutllama-4-scout
43%
92
Llama 4 Maverickllama-4-maverick
42%
93
LFM2-24B-A2Blfm2-24b-a2b
42%
94
41.4%
95
Gemma 3 27Bgemma-3-27b
41%
96
DeepSeek-R1deepseek-r1
40%
97
Grok 3 [Beta]grok-3-beta
38%
98
Nova Pronova-pro
37%
99
Qwen3 235B 2507 (Reasoning)qwen3-235b-2507-reasoning
36%
100
Qwen3 235B 2507qwen3-235b-2507
35%
101
GLM-4.5glm-4-5
33%
102
Ministral 3 8B (Reasoning)ministral-3-8b-reasoning
33%
103
MiniMax M1 80kminimax-m1-80k
32%
104
GLM-4.5-Airglm-4-5-air
31%
105
LFM2.5-1.2B-Thinkinglfm2-5-1-2b-thinking
31%
106
DeepSeek V3.1 (Reasoning)deepseek-v3-1-reasoning
30%
107
DeepSeek V3.1deepseek-v3-1
29%
108
GPT-OSS 20Bgpt-oss-20b
27%
109
Mistral 7B v0.3mistral-7b-v0-3
26%
110
Ministral 3 8Bministral-3-8b
26%
111
Ministral 3 3B (Reasoning)ministral-3-3b-reasoning
26%
112
Mistral 8x7B v0.2mistral-8x7b-v0-2
25%
113
LFM2.5-1.2B-Instructlfm2-5-1-2b-instruct
22%
114
Ministral 3 3Bministral-3-3b
20%

FAQ

What does MuSR measure?

A dataset for evaluating language models on multistep soft reasoning tasks specified in natural language narratives. Tests the ability to perform complex, structured reasoning.

Which model leads the published MuSR snapshot?

GPT-5.2 Pro currently leads the published MuSR snapshot with a tracked score of 95%. BenchLM shows this benchmark for display only and does not use it in overall rankings.

How many models are evaluated on MuSR?

114 AI models are included in BenchLM's mirrored MuSR snapshot, based on the public leaderboard captured on April 20, 2026.

Last updated: April 20, 2026 · mirrored from the public benchmark leaderboard

The AI models change fast. We track them for you.

For engineers, researchers, and the plain curious — a weekly brief on new models, ranking shifts, and pricing changes.

Free. No spam. Unsubscribe anytime.