Skip to main content

Multilingual Grade School Math (MGSM)

A multilingual benchmark that translates 250 grade school math problems from GSM8K into 10 typologically diverse languages: Bengali, German, Spanish, French, Japanese, Russian, Swahili, Telugu, Thai, and Chinese.

Top models on MGSM — June 2, 2026

As of June 2, 2026, DeepSeek V4 Flash Base leads the MGSM leaderboard with 85.7% , followed by DeepSeek V4 Pro Base (84.4%).

2 modelsMultilingual35% of category scoreStaleUpdated June 2, 2026

About MGSM

Year

2022

Tasks

250 problems × 11 languages

Format

Math word problems

Difficulty

Grade school math, multilingual

MGSM evaluates mathematical reasoning across languages, revealing that performance can vary significantly across languages, with lower-resource languages (Bengali, Swahili, Telugu) typically showing the largest gaps.

BenchLM freshness & provenance

Version

MGSM 2022

Refresh cadence

Static

Staleness state

Stale

Question availability

Public benchmark set

Stale

BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.

Leaderboard (2 models)

1
85.7%
2
84.4%

FAQ

What does MGSM measure?

A multilingual benchmark that translates 250 grade school math problems from GSM8K into 10 typologically diverse languages: Bengali, German, Spanish, French, Japanese, Russian, Swahili, Telugu, Thai, and Chinese.

Which model scores highest on MGSM?

DeepSeek V4 Flash Base by DeepSeek currently leads with a score of 85.7% on MGSM.

How many models are evaluated on MGSM?

2 AI models have been evaluated on MGSM on BenchLM.

Last updated: June 2, 2026 · BenchLM version MGSM 2022

The AI models change fast. We track them for you.

For engineers, researchers, and the plain curious — a weekly brief on new models, ranking shifts, and pricing changes.

Free. No spam. Unsubscribe anytime.