Skip to main content
Skip to main content
Multilingual

Multilingual Benchmarks — MGSM & MMLU-ProX Leaderboard

Performance across multiple languages

Bottom line: Most frontier models perform well on multilingual tasks, but the gap between English and non-English performance varies significantly by provider.

MGSM · MMLU-ProX

Best Multilingual picks

BenchLM summaries for multilingual plus the practical tradeoffs users check next: open weights, price, speed, latency, and context.

Top AI Models for MultilingualApril 2026

As of April 2026, Claude Mythos Preview leads the provisional multilingual leaderboard with a weighted score of 100.0%, followed by Gemini 3.1 Pro (100.0%) and GPT-5.4 (100.0%). BenchLM is currently showing 101 provisional-ranked models and 9 verified-ranked models in this category.

What changed

Claude Mythos Preview leads multilingual with the most consistent cross-language scores.

GPT-5.4 close second, strong on MMLU-ProX across all tested languages.

Claude Opus 4.6 holds #3, with particularly strong MGSM performance.

How to choose

Top models by benchmark

Grade school math problems translated into 10 diverse languages plus English(35% of category score)

Multilingual Leaderboard

Updated April 21, 2026

Sorted by multilingual weighted score. Switch between provisional-ranked and verified-ranked modes to see the broader public dataset versus sourced-only ranking. Click column headers to re-sort by overall score or any benchmark.

101 ranked models
CSVJSON
Provisional-ranked mode includes source-unverified non-generated benchmark evidence.P = provisional benchmark row
100%
99
100%
94
3
GPT-5.4
OpenAI
100%
93
4
100%
91
100%
Est.89
100%
Est.80
7
GPT-5.2
OpenAI
99%
83
91.3%
86
90.4%
Est.78
87.5%
Est.79
87.5%
Est.67
86.5%
Est.73
85.6%
Est.80
14
GPT-5.1
OpenAI
85.5%
Est.80
85.5%
Est.78
84.6%
Est.86
17
84.6%
Est.68
18
84%
80
85.7%
81.7%
Est.84
20
81.7%
Est.83
21
81.7%
Est.79
22
81.5%
76
84.7%
76.9%
Est.72
24
74.3%
65
84.7%
74.1%
68
82.2%
Showing 25 of 101

These rankings update weekly

Get notified when models move. One email a week with what changed and why.

Free. No spam. Unsubscribe anytime.

Score in Context

What these scores mean

Multilingual carries a 7% weight in overall scoring. The weighted score blends MGSM (multilingual math reasoning) and MMLU-ProX (cross-language professional knowledge). This category reveals how well model capabilities transfer beyond English, where most training data is concentrated.

Known limitations

Only two benchmarks cover this category, which limits the signal. MGSM tests math reasoning specifically, not general language quality. Languages tested are limited — low-resource languages remain untested. A model scoring well here may still struggle with less common languages or dialects.

How we weight

Multilingual carries a 7% weight in BenchLM.ai's overall scoring. Cross-language performance reveals how well model capabilities transfer beyond English. See the multilingual leaderboard or compare with knowledge benchmarks.

Leaderboards exclude benchmark rows that BenchLM generated from other scores or cloned from reference models. When a weighted benchmark is missing after that filter, the category falls back to the remaining trustworthy public rows instead of filling the gap with synthetic values.

The full scoring rules, freshness handling, and runtime/pricing caveats live on the BenchLM methodology page.

BenchmarkWeightStatusDescription
MGSM35%WeightedGrade school math problems translated into 10 diverse languages plus English
MMLU-ProX65%WeightedBroad multilingual professional benchmark across many languages

Multilingual benchmark updates

Which model handles your language best? Updated weekly.

Free. No spam. Unsubscribe anytime.

About Multilingual Benchmarks

Grade school math problems translated into 10 diverse languages plus English

Related