A dataset for evaluating language models on multistep soft reasoning tasks specified in natural language narratives. Tests the ability to perform complex, structured reasoning.
Year
2023
Tasks
Multi-step reasoning
Format
Narrative-based reasoning
Difficulty
Complex reasoning tasks
MuSR challenges models to perform multistep reasoning over complex narratives. Unlike simple factual questions, it requires models to track multiple entities, relationships, and logical steps across extended contexts.
MuSR: Testing the Limits of Chain-of-thought with Multistep Soft ReasoningA dataset for evaluating language models on multistep soft reasoning tasks specified in natural language narratives. Tests the ability to perform complex, structured reasoning.
GPT-5.4 by OpenAI currently leads with a score of 93 on MuSR.
88 AI models have been evaluated on MuSR on BenchLM.