Multilingual Grade School Math (MGSM)

A multilingual benchmark that translates 250 grade school math problems from GSM8K into 10 typologically diverse languages: Bengali, German, Spanish, French, Japanese, Russian, Swahili, Telugu, Thai, and Chinese.

About MGSM

Year

2022

Tasks

250 problems × 11 languages

Format

Math word problems

Difficulty

Grade school math, multilingual

MGSM evaluates mathematical reasoning across languages, revealing that performance can vary significantly across languages, with lower-resource languages (Bengali, Swahili, Telugu) typically showing the largest gaps.

Language Models are Multilingual Chain-of-Thought Reasoners

Leaderboard (88 models)

#1GPT-5.3 Codex
96
#2Claude Opus 4.6
96
#3Gemini 3.1 Pro
96
#4Grok 4.1
96
#5GPT-5.4
95
#6GPT-5.2
95
#8GPT-5.2-Codex
91
#9Claude Sonnet 4.6
91
#10Claude Sonnet 4.5
91
#12Claude Opus 4.5
90
#13o1-preview
90
#14GPT-5 (medium)
90
#16GPT-5.1
89
#17GPT-5 (high)
89
#18Gemini 3 Pro
89
#19GLM-5 (Reasoning)
89
#21Kimi K2.5 (Reasoning)
88
#22DeepSeekMath V2
87
#23Claude 4.1 Opus
85
#24Gemini 3 Flash
85
#25GLM-4.7-Flash
85
#26Claude 3.5 Sonnet
85
#28Grok 4
84
#29GLM-5
84
#30Qwen2.5-72B
84
#31DeepSeek V3.2
84
#32Gemini 2.5 Pro
84
#33Claude 4 Sonnet
84
#34MiniMax M2.5
84
#37o3-pro
83
#38o3
83
#39DeepSeek Coder 2.0
83
#40o4-mini (high)
83
#41MiMo-V2-Flash
83
#42Kimi K2.5
83
#43GPT-5 mini
82
#44Qwen3.5 397B
82
#45DeepSeek LLM 2.0
82
#46Mistral Large 3
82
#47Claude Haiku 4.5
82
#48GPT-4o
82
#49GLM-4.7
81
#50Qwen2.5-1M
81
#52Mistral Large 2
81
#53Gemini 1.5 Pro
76
#55GPT-4 Turbo
75
#56Nemotron-4 15B
75
#58Mistral 8x7B
74
#59Z-1
74
#61Gemini 2.5 Flash
74
#63Claude 3 Opus
73
#64Claude 3 Haiku
73
#65Moonshot v1
73
#66Gemini 1.0 Pro
72
#67Llama 3 70B
72
#68GPT-OSS 120B
72
#70Gemma 3 27B
64
#72DeepSeek V3.1
64
#73Llama 4 Scout
63
#75Qwen2.5-VL-32B
63
#76Qwen3 235B 2507
63
#77MiniMax M1 80k
63
#78GLM-4.5-Air
63
#80Mistral 7B v0.3
62
#81Mistral 8x7B v0.2
62
#82DeepSeek-R1
61
#83Nova Pro
61
#84Kimi K2
61
#85GPT-OSS 20B
61
#88GLM-4.5
60

FAQ

What does MGSM measure?

A multilingual benchmark that translates 250 grade school math problems from GSM8K into 10 typologically diverse languages: Bengali, German, Spanish, French, Japanese, Russian, Swahili, Telugu, Thai, and Chinese.

Which model scores highest on MGSM?

GPT-5.3 Codex by OpenAI currently leads with a score of 96 on MGSM.

How many models are evaluated on MGSM?

88 AI models have been evaluated on MGSM on BenchLM.