Testing the Limits of Chain-of-thought with Multistep Soft Reasoning (MuSR)

A dataset for evaluating language models on multistep soft reasoning tasks specified in natural language narratives. Tests the ability to perform complex, structured reasoning.

About MuSR

Year

2023

Tasks

Multi-step reasoning

Format

Narrative-based reasoning

Difficulty

Complex reasoning tasks

MuSR challenges models to perform multistep reasoning over complex narratives. Unlike simple factual questions, it requires models to track multiple entities, relationships, and logical steps across extended contexts.

MuSR: Testing the Limits of Chain-of-thought with Multistep Soft Reasoning

Leaderboard (88 models)

#1GPT-5.4
93
#2Gemini 3.1 Pro
93
#3Claude Opus 4.6
93
#4GPT-5.3 Codex
93
#5Grok 4.1
93
#6GPT-5.2
93
#7GPT-5.2-Codex
93
#9Claude Sonnet 4.6
93
#10Claude Opus 4.5
93
#11Gemini 3 Pro
93
#13GPT-5.1
91
#14GLM-5 (Reasoning)
90
#15Claude Sonnet 4.5
89
#17GPT-5 (high)
87
#18o1-preview
86
#19Kimi K2.5 (Reasoning)
86
#20GPT-5 (medium)
85
#22o3-pro
84
#23GPT-5 mini
82
#24o3
82
#25GLM-5
82
#26Grok 4
81
#28GLM-4.7
80
#29Qwen2.5-1M
79
#30Gemini 2.5 Pro
79
#31DeepSeek V3.2
79
#32Qwen2.5-72B
78
#33o4-mini (high)
78
#34Qwen3.5 397B
78
#35DeepSeek Coder 2.0
76
#36DeepSeek LLM 2.0
75
#37DeepSeekMath V2
75
#38MiMo-V2-Flash
74
#39Claude 4.1 Opus
72
#40Kimi K2.5
72
#41Mistral Large 3
71
#42Claude 4 Sonnet
69
#44MiniMax M2.5
68
#46Gemini 3 Flash
65
#47Mistral Large 2
64
#48Claude Haiku 4.5
63
#49GPT-4o
62
#50Mistral 8x7B
61
#51Claude 3.5 Sonnet
61
#52GLM-4.7-Flash
61
#53Gemini 1.5 Pro
60
#56Gemini 1.0 Pro
58
#58Claude 3 Opus
57
#59GPT-4 Turbo
56
#60Llama 3 70B
54
#61Claude 3 Haiku
52
#63Nemotron-4 15B
50
#64Moonshot v1
49
#65Z-1
48
#66GPT-OSS 120B
47
#67Gemini 2.5 Flash
46
#70Llama 4 Scout
43
#72Gemma 3 27B
41
#73DeepSeek-R1
40
#74Qwen2.5-VL-32B
39
#76Nova Pro
37
#78Qwen3 235B 2507
35
#80GLM-4.5
33
#81MiniMax M1 80k
32
#82GLM-4.5-Air
31
#84DeepSeek V3.1
29
#85Kimi K2
28
#86GPT-OSS 20B
27
#87Mistral 7B v0.3
26
#88Mistral 8x7B v0.2
25

FAQ

What does MuSR measure?

A dataset for evaluating language models on multistep soft reasoning tasks specified in natural language narratives. Tests the ability to perform complex, structured reasoning.

Which model scores highest on MuSR?

GPT-5.4 by OpenAI currently leads with a score of 93 on MuSR.

How many models are evaluated on MuSR?

88 AI models have been evaluated on MuSR on BenchLM.