Testing the Limits of Chain-of-thought with Multistep Soft Reasoning (MuSR)

Name: Testing the Limits of Chain-of-thought with Multistep Soft Reasoning
Creator: BenchLM

A dataset for evaluating language models on multistep soft reasoning tasks specified in natural language narratives. Tests the ability to perform complex, structured reasoning.

How BenchLM shows MuSR right now

BenchLM is tracking MuSR in the local dataset, but exact-source verification records for these rows are still being attached. To avoid a blank benchmark page, BenchLM shows the current tracked rows below as a display-only reference table.

These tracked rows are useful for inspection and spot-checking, but until exact-source attachments are completed they should not be treated as fully verified public benchmark rows.

113 tracked modelsLocal tracked rowsAwaiting exact-source attachmentsDisplay only

MuSR: Testing the Limits of Chain-of-thought with Multistep Soft Reasoning

Tracked score on MuSR — June 2, 2026

BenchLM mirrors the published tracked score view for MuSR. GPT-5.2 Pro leads the public snapshot at 95% , followed by GPT-5.4 (94%) and GPT-5.3 Instant (94%). BenchLM does not use these results to rank models overall.

GPT-5.2 Pro

OpenAI

gpt-5-2-pro

95%

Overall —

GPT-5.4

OpenAI

gpt-5-4

94%

Overall —

GPT-5.3 Instant

OpenAI

gpt-5-3-instant

94%

Overall —

113 modelsReasoning20% of category scoreStaleUpdated June 2, 2026

The published MuSR snapshot is tightly clustered at the top: GPT-5.2 Pro sits at 95%, while the third row is only 1.0 points behind. The broader top-10 spread is 2.0 points, so many of the published scores sit in a relatively narrow band.

113 models have been evaluated on MuSR. The benchmark falls in the Reasoning category. This category carries a 17% weight in BenchLM.ai's overall scoring system. Within that category, MuSR contributes 20% of the category score, so strong performance here directly affects a model's overall ranking.

About MuSR

Year

2023

Tasks

Multi-step reasoning

Format

Narrative-based reasoning

Difficulty

Complex reasoning tasks

MuSR challenges models to perform multistep reasoning over complex narratives. Unlike simple factual questions, it requires models to track multiple entities, relationships, and logical steps across extended contexts.

MuSR: Testing the Limits of Chain-of-thought with Multistep Soft Reasoning Public benchmark source

BenchLM freshness & provenance

Version

MuSR 2023

Refresh cadence

Static

Staleness state

Stale

Question availability

Public benchmark set

Stale

BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.

Tracked score table (113 models)

GPT-5.2 Progpt-5-2-pro

OpenAI

95%

GPT-5.4gpt-5-4

OpenAI

94%

GPT-5.3 Instantgpt-5-3-instant

OpenAI

94%

GPT-5.2-Codexgpt-5-2-codex

OpenAI

93%

GPT-5.3 Codexgpt-5-3-codex

OpenAI

93%

Grok 4.1grok-4-1

xAI

93%

Gemini 3 Pro Deep Thinkgemini-3-pro-deep-think

Google

93%

Gemini 3.1 Progemini-3-1-pro

Google

93%

Claude Opus 4.6claude-opus-4-6

Anthropic

93%

GPT-5.2gpt-5-2

OpenAI

93%

Claude Sonnet 4.6claude-sonnet-4-6

Anthropic

93%

Gemini 3 Progemini-3-pro

Google

93%

Claude Opus 4.5claude-opus-4-5

Anthropic

93%

GPT-5.2 Instantgpt-5-2-instant

OpenAI

93%

GPT-5.1-Codex-Maxgpt-5-1-codex-max

OpenAI

92%

GPT-5.3-Codex-Sparkgpt-5-3-codex-spark

OpenAI

92%

GPT-5.1gpt-5-1

OpenAI

91%

GLM-5 (Reasoning)glm-5-reasoning

Z.AI

90%

Claude Sonnet 4.5claude-sonnet-4-5

Anthropic

89%

Grok 4.1 Fastgrok-4-1-fast

xAI

88%

GPT-5 (high)gpt-5-high

OpenAI

87%

o1-preview

OpenAI

86%

Kimi K2.5 (Reasoning)kimi-k2-5-reasoning

Moonshot AI

86%

GPT-5 (medium)gpt-5-medium

OpenAI

85%

Qwen3.5 397B (Reasoning)qwen3-5-397b-reasoning

Alibaba

85%

o3-pro

OpenAI

84%

GLM-5.1glm-5-1

Z.AI

82%

OpenAI

82%

GLM-5glm-5

Z.AI

82%

Step 3.5 Flashstep-3-5-flash

StepFun

82%

GPT-5 minigpt-5-mini

OpenAI

82%

Mercury 2mercury-2

Inception

82%

Grok 4grok-4

xAI

81%

DeepSeek V3.2 (Thinking)deepseek-v3-2-thinking

DeepSeek

81%

GLM-4.7glm-4-7

Z.AI

80%

Qwen2.5-1Mqwen2-5-1m

Alibaba

79%

Gemini 2.5 Progemini-2-5-pro

Google

79%

DeepSeek V3.2deepseek-v3-2

DeepSeek

79%

Qwen3.5 397Bqwen3-5-397b

Alibaba

78%

Qwen2.5-72Bqwen2-5-72b

Alibaba

78%

o4-mini (high)o4-mini-high

OpenAI

78%

DeepSeek Coder 2.0deepseek-coder-2-0

DeepSeek

76%

DeepSeekMath V2deepseekmath-v2

DeepSeek

75%

DeepSeek LLM 2.0deepseek-llm-2-0

DeepSeek

75%

MiMo-V2-Flashmimo-v2-flash

Xiaomi

74%

Aion-2.0aion-2-0

Aion Labs

74%

Kimi K2.5kimi-k2-5

Moonshot AI

72%

Claude 4.1 Opusclaude-4-1-opus

Anthropic

72%

Claude 4.1 Opus Thinkingclaude-4-1-opus-thinking

Anthropic

72%

Mistral Large 3mistral-large-3

Mistral

71%

Ministral 3 14B (Reasoning)ministral-3-14b-reasoning

Mistral

70%

Claude 4 Sonnetclaude-4-sonnet

Anthropic

69%

Seed 1.6seed-1-6

ByteDance

69%

MiniMax M2.5minimax-m2-5

MiniMax

68%

Llama 3.1 405Bllama-3-1-405b

Meta

66%

Seed-2.0-Liteseed-2-0-lite

ByteDance

66%

Gemini 3 Flashgemini-3-flash

Google

65%

Mistral Large 2mistral-large-2

Mistral

64%

Ministral 3 14Bministral-3-14b

Mistral

64%

Claude Haiku 4.5claude-haiku-4-5

Anthropic

63%

GPT-4ogpt-4o

OpenAI

62%

Nemotron 3 Super 120B A12Bnemotron-3-super-120b-a12b

NVIDIA

62%

Claude 3.5 Sonnetclaude-3-5-sonnet

Anthropic

61%

Mistral 8x7Bmistral-8x7b

Mistral

61%

GLM-4.7-Flashglm-4-7-flash

Z.AI

61%

Nemotron 3 Super 100Bnemotron-3-super-100b

NVIDIA

60%

Gemini 1.5 Progemini-1-5-pro

Google

60%

Grok Code Fast 1grok-code-fast-1

xAI

59%

Seed 1.6 Flashseed-1-6-flash

ByteDance

59%

Gemini 3.1 Flash-Litegemini-3-1-flash-lite

Google

58%

Gemini 1.0 Progemini-1-0-pro

Google

58%

Claude 3 Opusclaude-3-opus

Anthropic

57%

Seed-2.0-Miniseed-2-0-mini

ByteDance

57%

Ternary Bonsai 8Bternary-bonsai-8b

Prism ML

56.2%

GPT-4 Turbogpt-4-turbo

OpenAI

56%

Llama 3 70Bllama-3-70b

Meta

54%

Claude 3 Haikuclaude-3-haiku

Anthropic

52%

Nemotron 3 Nano 30Bnemotron-3-nano-30b

NVIDIA

52%

Ternary Bonsai 1.7Bternary-bonsai-1-7b

Prism ML

50.8%

1-bit Bonsai 8Bbonsai-8b

Prism ML

50%

Nemotron-4 15Bnemotron-4-15b

NVIDIA

50%

Moonshot v1moonshot-v1

Moonshot AI

49%

Z-1z-1

48%

GPT-OSS 120Bgpt-oss-120b

OpenAI

47%

Gemini 2.5 Flashgemini-2-5-flash

Google

46%

Ternary Bonsai 4Bternary-bonsai-4b

Prism ML

45.1%

1-bit Bonsai 1.7Bbonsai-1-7b

Prism ML

45.1%

Nemotron Ultra 253Bnemotron-ultra-253b

NVIDIA

45%

Llama 4 Behemothllama-4-behemoth

Meta

44%

Llama 4 Scoutllama-4-scout

Meta

43%

Llama 4 Maverickllama-4-maverick

Meta

42%

LFM2-24B-A2Blfm2-24b-a2b

LiquidAI

42%

1-bit Bonsai 4Bbonsai-4b

Prism ML

41.4%

Gemma 3 27Bgemma-3-27b

Google

41%

DeepSeek-R1deepseek-r1

DeepSeek

40%

Grok 3 [Beta]grok-3-beta

xAI

38%

Nova Pronova-pro

Amazon

37%

Qwen3 235B 2507 (Reasoning)qwen3-235b-2507-reasoning

Alibaba

36%

Qwen3 235B 2507qwen3-235b-2507

Alibaba

35%

100

GLM-4.5glm-4-5

Z.AI

33%

101

Ministral 3 8B (Reasoning)ministral-3-8b-reasoning

Mistral

33%

102

MiniMax M1 80kminimax-m1-80k

MiniMax

32%

103

GLM-4.5-Airglm-4-5-air

Z.AI

31%

104

LFM2.5-1.2B-Thinkinglfm2-5-1-2b-thinking

LiquidAI

31%

105

DeepSeek V3.1 (Reasoning)deepseek-v3-1-reasoning

DeepSeek

30%

106

DeepSeek V3.1deepseek-v3-1

DeepSeek

29%

107

GPT-OSS 20Bgpt-oss-20b

OpenAI

27%

108

Mistral 7B v0.3mistral-7b-v0-3

Mistral

26%

109

Ministral 3 8Bministral-3-8b

Mistral

26%

110

Ministral 3 3B (Reasoning)ministral-3-3b-reasoning

Mistral

26%

111

Mistral 8x7B v0.2mistral-8x7b-v0-2

Mistral

25%

112

LFM2.5-1.2B-Instructlfm2-5-1-2b-instruct

LiquidAI

22%

113

Ministral 3 3Bministral-3-3b

Mistral

20%

FAQ

What does MuSR measure?

A dataset for evaluating language models on multistep soft reasoning tasks specified in natural language narratives. Tests the ability to perform complex, structured reasoning.

Which model leads the published MuSR snapshot?

GPT-5.2 Pro currently leads the published MuSR snapshot with 95% tracked score. BenchLM shows this benchmark for display only and does not use it in overall rankings.

How many models are evaluated on MuSR?

113 AI models are included in BenchLM's mirrored MuSR snapshot, based on the public leaderboard captured on June 2, 2026.

Last updated: June 2, 2026 · mirrored from the public benchmark leaderboard

The AI models change fast. We track them for you.

For engineers, researchers, and the plain curious — a weekly brief on new models, ranking shifts, and pricing changes.

Free. No spam. Unsubscribe anytime.