A multilingual benchmark that translates 250 grade school math problems from GSM8K into 10 typologically diverse languages: Bengali, German, Spanish, French, Japanese, Russian, Swahili, Telugu, Thai, and Chinese.
Year
2022
Tasks
250 problems × 11 languages
Format
Math word problems
Difficulty
Grade school math, multilingual
MGSM evaluates mathematical reasoning across languages, revealing that performance can vary significantly across languages, with lower-resource languages (Bengali, Swahili, Telugu) typically showing the largest gaps.
Language Models are Multilingual Chain-of-Thought ReasonersA multilingual benchmark that translates 250 grade school math problems from GSM8K into 10 typologically diverse languages: Bengali, German, Spanish, French, Japanese, Russian, Swahili, Telugu, Thai, and Chinese.
GPT-5.3 Codex by OpenAI currently leads with a score of 96 on MGSM.
88 AI models have been evaluated on MGSM on BenchLM.