What is the best LLM for multilingual tasks?

The top multilingual LLMs are ranked by benchmarks like MGSM and MMLU-ProX, which test performance across multiple languages to identify models with the strongest cross-lingual capabilities.

What do MGSM and MMLU-ProX evaluate?

MGSM evaluates multilingual math reasoning, while MMLU-ProX is a broader multilingual professional benchmark that captures cross-language knowledge and reasoning beyond translated arithmetic.

How do multilingual benchmarks differ from English-only benchmarks?

Multilingual benchmarks test model performance across many languages simultaneously, revealing how well capabilities transfer beyond English, where most training data is concentrated.

Multilingual

Multilingual Benchmarks — MGSM & MMLU-ProX Leaderboard

Name: Multilingual Benchmarks — LLM Leaderboard
Creator: BenchLM.ai

Performance across multiple languages

Bottom line: Most frontier models perform well on multilingual tasks, but the gap between English and non-English performance varies significantly by provider.

MGSM · MMLU-ProX

Best Multilingual picks

BenchLM summaries for multilingual plus the practical tradeoffs users check next: open weights, price, speed, latency, and context.

How BenchLM scores these

Best Multilingual

Claude Mythos Preview

100

category score

Anthropic

Best Open Weight

DeepSeek V4 Pro (Max)

overall score

DeepSeek

Cheapest

Qwen3.6-27B

$0.00

avg / 1M tokens

Alibaba

Fastest

Mercury 2

789

tokens / sec

Inception

Lowest Latency

Command A+

0.25s

TTFT

Cohere

Largest Context

Llama 4 Scout

10M

context window

Top AI Models for Multilingual — June 2026

As of June 2026, Claude Mythos Preview leads the provisional multilingual leaderboard with a weighted score of 100.0%, followed by Gemini 3.1 Pro (100.0%) and Grok 4.1 (100.0%). BenchLM is currently showing 103 provisional-ranked models and 12 verified-ranked models in this category.

1Proprietary

Claude Mythos Preview

Anthropic

100.0%weighted

Best cross-language consistency. Smallest gap between English and non-English performance.

2Proprietary

Gemini 3.1 Pro

Google

100.0%weighted

3Proprietary

Grok 4.1

xAI

100.0%weighted

103 provisional-ranked12 verified-ranked2 benchmarksUpdated June 2, 2026

What changed

Claude Mythos Preview leads multilingual with the most consistent cross-language scores.

GPT-5.4 close second, strong on MMLU-ProX across all tested languages.

Claude Opus 4.6 holds #3, with particularly strong MGSM performance.

How to choose

Non-English production deployment?

Claude Mythos Preview — most consistent cross-language

Professional knowledge in multiple languages?

GPT-5.4 — best MMLU-ProX scores

Math reasoning in non-English?

Claude Opus 4.6 — top MGSM performance

Multilingual on a budget?

Gemini 3.1 Pro — broad language support at low cost

Top models by benchmark

Grade school math problems translated into 10 diverse languages plus English(35% of category score)

1DeepSeek V4 Flash Base

85.7

2DeepSeek V4 Pro Base

84.4

Multilingual Leaderboard

Updated June 2, 2026

Sorted by multilingual weighted score. Switch between provisional-ranked and verified-ranked modes to see the broader public dataset versus sourced-only ranking. Click column headers to re-sort by overall score or any benchmark.

103 ranked models

CSV JSON

Provisional-ranked mode includes source-unverified non-generated benchmark evidence.P = provisional benchmark row


1 Claude Mythos Preview Anthropic	Closed	Reasoning	1M	$25.00 / $125.00	N/A	N/A	100%	99	—	—
2 Gemini 3.1 Pro Google	Closed	Standard	1M	$2.00 / $12.00	109	29.71s	100%	92	—	—
3 Grok 4.1 xAI	Closed	Standard	1M	N/A	N/A	N/A	100%	Est.90	—	—
4 GPT-5.4 OpenAI	Closed	Reasoning	1.05M	$2.50 / $15.00	74	151.79s	100%	89	—	—
5 Claude Opus 4.6 Anthropic	Closed	Standard	1M	$5.00 / $25.00	40	1.78s	100%	87	—	—
6 GPT-5.3 Codex OpenAI	Closed	Reasoning	400K	$1.75 / $14.00	79	88.26s	100%	Est.86	—	—
7 GPT-5.2 OpenAI	Closed	Reasoning	400K	$1.75 / $14.00	73	130.34s	99%	79	—	—
8 Claude Sonnet 4.6 Anthropic	Closed	Standard	200K	$3.00 / $15.00	44	1.48s	91.3%	83	—	—
9 Kimi K2.5 (Reasoning) Moonshot AI Self-host	Closed	Reasoning	128K	$0.60 / $3.00	N/A	N/A	90.4%	76	—	—
10 Qwen3.7 Max Alibaba	Closed	Reasoning	1M	N/A	N/A	N/A	88.2%	91	—	87%
11 GPT-5.2-Codex OpenAI	Closed	Reasoning	400K	$1.75 / $14.00	123	87.34s	87.5%	Est.76	—	—
12 Claude Sonnet 4.5 Anthropic	Closed	Standard	200K	$3.00 / $15.00	N/A	N/A	87.5%	Est.65	—	—
13 GPT-5 (medium) OpenAI	Closed	Reasoning	128K	N/A	83	36.28s	86.5%	Est.70	—	—
14 Qwen3.5 397B (Reasoning) Alibaba Self-host	Open	Reasoning	128K	$0.60 / $3.60	N/A	N/A	85.6%	Est.78	—	—
15 GPT-5.1 OpenAI	Closed	Reasoning	200K	$1.25 / $10.00	111	57.47s	85.5%	Est.78	—	—
16 GPT-5.1-Codex-Max OpenAI	Closed	Reasoning	400K	$1.25 / $10.00	N/A	N/A	85.5%	Est.75	—	—
17 Gemini 3 Pro Deep Think Google	Closed	Reasoning	2M	N/A	N/A	N/A	84.6%	Est.90	—	—
18 o1-preview OpenAI	Closed	Reasoning	200K	$15.00 / $60.00	N/A	N/A	84.6%	Est.83	—	—
19 Claude Opus 4.5 Anthropic	Closed	Standard	200K	$5.00 / $25.00	46	1.01s	84%	76	—	85.7%
20 Gemini 3 Pro Google	Closed	Standard	2M	$2.00 / $12.00	109	32.65s	81.7%	81	—	—
21 GLM-5 (Reasoning) Z.AI Self-host	Open	Reasoning	200K	$1.00 / $3.20	N/A	N/A	81.7%	Est.80	—	—
22 GPT-5 (high) OpenAI	Closed	Reasoning	128K	$1.25 / $10.00	83	36.28s	81.7%	Est.77	—	—
23 Qwen3.6 Plus Alibaba	Closed	Reasoning	1M	N/A	N/A	N/A	81.5%	73	—	84.7%
24 Grok 4.1 Fast xAI	Closed	Standard	1M	$0.20 / $0.50	138	0.54s	76.9%	Est.69	—	—
25 Qwen3.5 397B Alibaba Self-host	Open	Standard	128K	$0.60 / $3.60	96	2.44s	74.3%	63	—	84.7%

Showing 25 of 103

These rankings update weekly

Get notified when models move. One email a week with what changed and why.

Free. No spam. Unsubscribe anytime.

Score in Context

What these scores mean

Multilingual carries a 7% weight in overall scoring. The weighted score blends MGSM (multilingual math reasoning) and MMLU-ProX (cross-language professional knowledge). This category reveals how well model capabilities transfer beyond English, where most training data is concentrated.

Known limitations

Only two benchmarks cover this category, which limits the signal. MGSM tests math reasoning specifically, not general language quality. Languages tested are limited — low-resource languages remain untested. A model scoring well here may still struggle with less common languages or dialects.

How we weight

Multilingual carries a 7% weight in BenchLM.ai's overall scoring. Cross-language performance reveals how well model capabilities transfer beyond English. See the multilingual leaderboard or compare with knowledge benchmarks.

Leaderboards exclude benchmark rows that BenchLM generated from other scores or cloned from reference models. When a weighted benchmark is missing after that filter, the category falls back to the remaining trustworthy public rows instead of filling the gap with synthetic values.

The full scoring rules, freshness handling, and runtime/pricing caveats live on the BenchLM methodology page.

Benchmark	Weight	Status	Description
MGSM	35%	Weighted	Grade school math problems translated into 10 diverse languages plus English
MMLU-ProX	65%	Weighted	Broad multilingual professional benchmark across many languages

Multilingual benchmark updates

Which model handles your language best? Updated weekly.

Free. No spam. Unsubscribe anytime.

About Multilingual Benchmarks

Grade school math problems translated into 10 diverse languages plus English

Best LLMs Overall

Top models ranked across all benchmark categories.

View

Knowledge Benchmarks

Factual recall and domain expertise leaderboard.

View

Instruction Following

How well models follow complex prompts.

View

LLM Selector Quiz

Find the best model for multilingual needs.

View

Multilingual Benchmarks — MGSM & MMLU-ProX Leaderboard

Best Multilingual picks

Top AI Models for Multilingual — June 2026

What changed

How to choose

Top models by benchmark

Multilingual Leaderboard

These rankings update weekly

Score in Context

What these scores mean

Known limitations

How we weight

Multilingual benchmark updates

About Multilingual Benchmarks

Related

Stay ahead of the LLM curve