Multilingual Benchmarks — MGSM & MMLU-ProX Leaderboard
Performance across multiple languages
Bottom line: Most frontier models perform well on multilingual tasks, but the gap between English and non-English performance varies significantly by provider.
MGSM · MMLU-ProX
Best Multilingual picks
BenchLM summaries for multilingual plus the practical tradeoffs users check next: open weights, price, speed, latency, and context.
Top AI Models for Multilingual — April 2026
As of April 2026, Claude Mythos Preview leads the provisional multilingual leaderboard with a weighted score of 100.0%, followed by Gemini 3.1 Pro (100.0%) and GPT-5.4 (100.0%). BenchLM is currently showing 101 provisional-ranked models and 9 verified-ranked models in this category.
Claude Mythos Preview
Anthropic
Best cross-language consistency. Smallest gap between English and non-English performance.
Gemini 3.1 Pro
GPT-5.4
OpenAI
What changed
Claude Mythos Preview leads multilingual with the most consistent cross-language scores.
GPT-5.4 close second, strong on MMLU-ProX across all tested languages.
Claude Opus 4.6 holds #3, with particularly strong MGSM performance.
How to choose
Non-English production deployment?
Claude Mythos Preview — most consistent cross-language
Professional knowledge in multiple languages?
GPT-5.4 — best MMLU-ProX scores
Math reasoning in non-English?
Claude Opus 4.6 — top MGSM performance
Multilingual on a budget?
Gemini 3.1 Pro — broad language support at low cost
Top models by benchmark
Grade school math problems translated into 10 diverse languages plus English(35% of category score)
Multilingual Leaderboard
Updated April 21, 2026Sorted by multilingual weighted score. Switch between provisional-ranked and verified-ranked modes to see the broader public dataset versus sourced-only ranking. Click column headers to re-sort by overall score or any benchmark.
1 Claude Mythos Preview Anthropic | 100% | 99 | — | — |
2 Gemini 3.1 Pro Google | 100% | 94 | — | — |
3 GPT-5.4 OpenAI | 100% | 93 | — | — |
4 Claude Opus 4.6 Anthropic | 100% | 91 | — | — |
5 GPT-5.3 Codex OpenAI | 100% | Est.89 | — | — |
6 Grok 4.1 xAI | 100% | Est.80 | — | — |
7 GPT-5.2 OpenAI | 99% | 83 | — | — |
8 Claude Sonnet 4.6 Anthropic | 91.3% | 86 | — | — |
9 Kimi K2.5 (Reasoning) Moonshot AI | 90.4% | Est.78 | — | — |
10 GPT-5.2-Codex OpenAI | 87.5% | Est.79 | — | — |
11 Claude Sonnet 4.5 Anthropic | 87.5% | Est.67 | — | — |
12 GPT-5 (medium) OpenAI | 86.5% | Est.73 | — | — |
13 Qwen3.5 397B (Reasoning) Alibaba | 85.6% | Est.80 | — | — |
14 GPT-5.1 OpenAI | 85.5% | Est.80 | — | — |
15 GPT-5.1-Codex-Max OpenAI | 85.5% | Est.78 | — | — |
16 Gemini 3 Pro Deep Think Google | 84.6% | Est.86 | — | — |
17 o1-preview OpenAI | 84.6% | Est.68 | — | — |
18 Claude Opus 4.5 Anthropic | 84% | 80 | — | 85.7% |
19 | 81.7% | Est.84 | — | — |
20 Gemini 3 Pro Google | 81.7% | Est.83 | — | — |
21 GPT-5 (high) OpenAI | 81.7% | Est.79 | — | — |
22 Qwen3.6 Plus Alibaba | 81.5% | 76 | — | 84.7% |
23 | 76.9% | Est.72 | — | — |
24 Qwen3.5 397B Alibaba | 74.3% | 65 | — | 84.7% |
25 Qwen3.5-122B-A10B Alibaba | 74.1% | 68 | — | 82.2% |
These rankings update weekly
Get notified when models move. One email a week with what changed and why.
Free. No spam. Unsubscribe anytime.
Score in Context
What these scores mean
Multilingual carries a 7% weight in overall scoring. The weighted score blends MGSM (multilingual math reasoning) and MMLU-ProX (cross-language professional knowledge). This category reveals how well model capabilities transfer beyond English, where most training data is concentrated.
Known limitations
Only two benchmarks cover this category, which limits the signal. MGSM tests math reasoning specifically, not general language quality. Languages tested are limited — low-resource languages remain untested. A model scoring well here may still struggle with less common languages or dialects.
How we weight
Multilingual carries a 7% weight in BenchLM.ai's overall scoring. Cross-language performance reveals how well model capabilities transfer beyond English. See the multilingual leaderboard or compare with knowledge benchmarks.
Leaderboards exclude benchmark rows that BenchLM generated from other scores or cloned from reference models. When a weighted benchmark is missing after that filter, the category falls back to the remaining trustworthy public rows instead of filling the gap with synthetic values.
The full scoring rules, freshness handling, and runtime/pricing caveats live on the BenchLM methodology page.
| Benchmark | Weight | Status | Description |
|---|---|---|---|
| MGSM | 35% | Weighted | Grade school math problems translated into 10 diverse languages plus English |
| MMLU-ProX | 65% | Weighted | Broad multilingual professional benchmark across many languages |
Multilingual benchmark updates
Which model handles your language best? Updated weekly.
Free. No spam. Unsubscribe anytime.
About Multilingual Benchmarks
Grade school math problems translated into 10 diverse languages plus English