Skip to main content

Multilingual Grade School Math (MGSM)

A multilingual benchmark that translates 250 grade school math problems from GSM8K into 10 typologically diverse languages: Bengali, German, Spanish, French, Japanese, Russian, Swahili, Telugu, Thai, and Chinese.

How BenchLM shows MGSM right now

BenchLM is tracking MGSM in the local dataset, but exact-source verification records for these rows are still being attached. To avoid a blank benchmark page, BenchLM shows the current tracked rows below as a display-only reference table.

These tracked rows are useful for inspection and spot-checking, but until exact-source attachments are completed they should not be treated as fully verified public benchmark rows.

111 tracked modelsLocal tracked rowsAwaiting exact-source attachmentsDisplay only

Tracked score on MGSM — April 21, 2026

BenchLM mirrors the published tracked score view for MGSM. GPT-5.3 Codex leads the public snapshot at 96% , followed by GPT-5.4 (96%) and Grok 4.1 (96%). BenchLM does not use these results to rank models overall.

111 modelsMultilingual35% of category scoreStaleUpdated April 21, 2026

The published MGSM snapshot is tightly clustered at the top: GPT-5.3 Codex sits at 96%, while the third row is only 0.0 points behind. The broader top-10 spread is 1.0 points, so many of the published scores sit in a relatively narrow band.

111 models have been evaluated on MGSM. The benchmark falls in the Multilingual category. This category carries a 7% weight in BenchLM.ai's overall scoring system. Within that category, MGSM contributes 35% of the category score, so strong performance here directly affects a model's overall ranking.

About MGSM

Year

2022

Tasks

250 problems × 11 languages

Format

Math word problems

Difficulty

Grade school math, multilingual

MGSM evaluates mathematical reasoning across languages, revealing that performance can vary significantly across languages, with lower-resource languages (Bengali, Swahili, Telugu) typically showing the largest gaps.

BenchLM freshness & provenance

Version

MGSM 2022

Refresh cadence

Static

Staleness state

Stale

Question availability

Public benchmark set

Stale

BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.

Tracked score table (111 models)

1
GPT-5.3 Codexgpt-5-3-codex
96%
2
GPT-5.4gpt-5-4
96%
3
Grok 4.1grok-4-1
96%
4
Gemini 3.1 Progemini-3-1-pro
96%
5
Claude Opus 4.6claude-opus-4-6
96%
6
Kimi K2.5 (Reasoning)kimi-k2-5-reasoning
96%
7
GPT-5.2 Progpt-5-2-pro
96%
8
GPT-5.3 Instantgpt-5-3-instant
96%
9
GPT-5.2gpt-5-2
95%
10
GPT-5.2 Instantgpt-5-2-instant
95%
11
GLM-4.7glm-4-7
94%
12
GPT-5.3-Codex-Sparkgpt-5-3-codex-spark
94%
13
Gemini 3 Pro Deep Thinkgemini-3-pro-deep-think
92%
14
GPT-5.2-Codexgpt-5-2-codex
91%
15
Claude Sonnet 4.6claude-sonnet-4-6
91%
16
Qwen3.5 397B (Reasoning)qwen3-5-397b-reasoning
91%
17
Claude Sonnet 4.5claude-sonnet-4-5
91%
18
GPT-5 (medium)gpt-5-medium
90%
19
90%
20
Claude Opus 4.5claude-opus-4-5
90%
21
GPT-5.1-Codex-Maxgpt-5-1-codex-max
89%
22
GPT-5 (high)gpt-5-high
89%
23
GLM-5 (Reasoning)glm-5-reasoning
89%
24
GPT-5.1gpt-5-1
89%
25
Gemini 3 Progemini-3-pro
89%
26
Grok 4.1 Fastgrok-4-1-fast
88%
27
Seed 1.6seed-1-6
88%
28
DeepSeekMath V2deepseekmath-v2
87%
29
GPT-4o minigpt-4o-mini
87%
30
Seed-2.0-Liteseed-2-0-lite
87%
31
Step 3.5 Flashstep-3-5-flash
86%
32
Nemotron 3 Super 120B A12Bnemotron-3-super-120b-a12b
86%
33
Claude 4.1 Opusclaude-4-1-opus
85%
34
Gemini 3 Flashgemini-3-flash
85%
35
Claude 3.5 Sonnetclaude-3-5-sonnet
85%
36
GLM-4.7-Flashglm-4-7-flash
85%
37
GLM-5glm-5
84%
38
Grok 4grok-4
84%
39
Claude 4 Sonnetclaude-4-sonnet
84%
40
Gemini 2.5 Progemini-2-5-pro
84%
41
DeepSeek V3.2 (Thinking)deepseek-v3-2-thinking
84%
42
Qwen2.5-72Bqwen2-5-72b
84%
43
DeepSeek V3.2deepseek-v3-2
84%
44
Nemotron 3 Super 100Bnemotron-3-super-100b
84%
45
Llama 3.1 405Bllama-3-1-405b
84%
46
MiniMax M2.5minimax-m2-5
84%
47
83%
48
MiMo-V2-Flashmimo-v2-flash
83%
49
83%
50
Kimi K2.5kimi-k2-5
83%
51
DeepSeek Coder 2.0deepseek-coder-2-0
83%
52
o4-mini (high)o4-mini-high
83%
53
Qwen3.5 397Bqwen3-5-397b
82%
54
Claude Haiku 4.5claude-haiku-4-5
82%
55
DeepSeek LLM 2.0deepseek-llm-2-0
82%
56
Claude 4.1 Opus Thinkingclaude-4-1-opus-thinking
82%
57
Mistral Large 3mistral-large-3
82%
58
GPT-4ogpt-4o
82%
59
GPT-5 minigpt-5-mini
82%
60
Qwen2.5-1Mqwen2-5-1m
81%
61
Nemotron 3 Ultra 500Bnemotron-3-ultra-500b
81%
62
Mistral Large 2mistral-large-2
81%
63
Mercury 2mercury-2
81%
64
Ministral 3 14B (Reasoning)ministral-3-14b-reasoning
81%
65
Phi-4phi-4
80.6%
66
Aion-2.0aion-2-0
80%
67
Ministral 3 14Bministral-3-14b
80%
68
Gemini 1.5 Progemini-1-5-pro
76%
69
Seed 1.6 Flashseed-1-6-flash
76%
70
Grok Code Fast 1grok-code-fast-1
75%
71
Nemotron-4 15Bnemotron-4-15b
75%
72
Nemotron 3 Nano 30Bnemotron-3-nano-30b
75%
73
Seed-2.0-Miniseed-2-0-mini
75%
74
Gemini 2.5 Flashgemini-2-5-flash
74%
75
Nemotron Ultra 253Bnemotron-ultra-253b
74%
76
Mistral 8x7Bmistral-8x7b
74%
77
Z-1z-1
74%
78
Gemini 3.1 Flash-Litegemini-3-1-flash-lite
73%
79
Claude 3 Opusclaude-3-opus
73%
80
Claude 3 Haikuclaude-3-haiku
73%
81
Moonshot v1moonshot-v1
73%
82
GPT-OSS 120Bgpt-oss-120b
72%
83
Llama 3 70Bllama-3-70b
72%
84
Gemini 1.0 Progemini-1-0-pro
72%
85
Llama 4 Behemothllama-4-behemoth
66%
86
DeepSeek V3.1 (Reasoning)deepseek-v3-1-reasoning
64%
87
Gemma 3 27Bgemma-3-27b
64%
88
DeepSeek V3.1deepseek-v3-1
64%
89
LFM2-24B-A2Blfm2-24b-a2b
64%
90
Qwen3 235B 2507qwen3-235b-2507
63%
91
Llama 4 Maverickllama-4-maverick
63%
92
Llama 4 Scoutllama-4-scout
63%
93
GLM-4.5-Airglm-4-5-air
63%
94
MiniMax M1 80kminimax-m1-80k
63%
95
Ministral 3 8B (Reasoning)ministral-3-8b-reasoning
63%
96
Ministral 3 8Bministral-3-8b
63%
97
Qwen3 235B 2507 (Reasoning)qwen3-235b-2507-reasoning
62%
98
Mistral 7B v0.3mistral-7b-v0-3
62%
99
Mistral 8x7B v0.2mistral-8x7b-v0-2
62%
100
LFM2.5-1.2B-Thinkinglfm2-5-1-2b-thinking
62%
101
LFM2.5-1.2B-Instructlfm2-5-1-2b-instruct
62%
102
DeepSeek-R1deepseek-r1
61%
103
GPT-OSS 20Bgpt-oss-20b
61%
104
Ministral 3 3B (Reasoning)ministral-3-3b-reasoning
61%
105
Ministral 3 3Bministral-3-3b
61%
106
Grok 3 [Beta]grok-3-beta
60%
107
GLM-4.5glm-4-5
60%
108
Granite-4.0-H-1Bgranite-4-0-h-1b
37.8%
109
Granite-4.0-1Bgranite-4-0-1b
27.5%
110
Granite-4.0-350Mgranite-4-0-350m
16.2%
111
Granite-4.0-H-350Mgranite-4-0-h-350m
14.7%

FAQ

What does MGSM measure?

A multilingual benchmark that translates 250 grade school math problems from GSM8K into 10 typologically diverse languages: Bengali, German, Spanish, French, Japanese, Russian, Swahili, Telugu, Thai, and Chinese.

Which model leads the published MGSM snapshot?

GPT-5.3 Codex currently leads the published MGSM snapshot with a tracked score of 96%. BenchLM shows this benchmark for display only and does not use it in overall rankings.

How many models are evaluated on MGSM?

111 AI models are included in BenchLM's mirrored MGSM snapshot, based on the public leaderboard captured on April 21, 2026.

Last updated: April 21, 2026 · mirrored from the public benchmark leaderboard

The AI models change fast. We track them for you.

For engineers, researchers, and the plain curious — a weekly brief on new models, ranking shifts, and pricing changes.

Free. No spam. Unsubscribe anytime.