MMLU-ProX (MMLU-ProX)

A multilingual extension of professional-level academic evaluation across many languages.

According to BenchLM.ai, GPT-5.4 Pro leads the MMLU-ProX benchmark with a score of 95, followed by GPT-5.4 (94) and Claude Opus 4.6 (94). The top models are clustered within 1 points, suggesting this benchmark is nearing saturation for frontier models.

121 models have been evaluated on MMLU-ProX. The benchmark falls in the multilingual category, which carries a 7% weight in BenchLM.ai's overall scoring system. Strong performance here directly impacts a model's overall ranking.

About MMLU-ProX

Year

2025

Tasks

Multilingual professional QA

Format

Multilingual multiple choice

Difficulty

Professional multilingual

MMLU-ProX expands multilingual evaluation beyond translated arithmetic, making it a better signal for broad cross-lingual reasoning and knowledge.

MMLU-ProX

Leaderboard (121 models)

#1GPT-5.4 Pro
95
#2GPT-5.4
94
#3Claude Opus 4.6
94
#4GPT-5.2 Instant
94
#5Gemini 3.1 Pro
93
#6GPT-5.2 Pro
92
#7GPT-5.3 Instant
92
#8GPT-5.3 Codex
91
#9GPT-5.2
91
#10Grok 4.1
91
#12Claude Sonnet 4.6
89
#13GPT-5.2-Codex
87
#15GPT-5.1
87
#16GPT-5 (medium)
87
#17Claude Sonnet 4.5
87
#18o1-preview
86
#19Kimi K2.5 (Reasoning)
86
#22GPT-5 (high)
85
#23GLM-5 (Reasoning)
85
#24Gemini 3 Pro
85
#25Claude Opus 4.5
84
#27Gemini 2.5 Pro
82
#28Step 3.5 Flash
81
#29GLM-5
81
#30Seed 1.6
81
#31Claude 4 Sonnet
81
#32DeepSeek V3.2
81
#33o4-mini (high)
81
#34MiniMax M2.5
81
#35o3-pro
80
#36o3
80
#37DeepSeekMath V2
80
#38Qwen2.5-1M
80
#39Claude 4.1 Opus
80
#40Seed-2.0-Lite
80
#41GLM-4.7-Flash
80
#43GPT-5 mini
79
#44Grok 4
79
#45Mercury 2
79
#47Qwen2.5-72B
79
#48Claude Haiku 4.5
79
#50GLM-4.7
78
#51DeepSeek Coder 2.0
78
#52Gemini 3 Flash
78
#53Claude 3.5 Sonnet
78
#54Kimi K2.5
78
#56Mistral Large 2
78
#57o1
77
#58MiMo-V2-Flash
77
#59DeepSeek LLM 2.0
77
#60Qwen3.5 397B
77
#61Mistral Large 3
77
#63Aion-2.0
77
#65Ministral 3 14B
75
#66o3-mini
73
#68GPT-4.1 mini
72
#69GPT-4o
72
#70Z-1
72
#71Seed 1.6 Flash
71
#72Mistral 8x7B
71
#73Nemotron-4 15B
71
#74Seed-2.0-Mini
70
#75Claude 3 Haiku
70
#76GPT-OSS 120B
70
#78GPT-4.1
69
#79Gemini 2.5 Flash
69
#81GPT-4o mini
68
#82Claude 3 Opus
68
#84Moonshot v1
68
#85Gemini 1.5 Pro
66
#86GPT-4 Turbo
65
#87Llama 3 70B
65
#88Gemini 1.0 Pro
64
#92Ministral 3 8B
61
#93DeepSeek-R1
60
#94Phi-4
60
#95Gemma 3 27B
60
#96LFM2-24B-A2B
60
#97Nova Pro
60
#99Mistral 7B v0.3
60
#100LFM2.5-1.2B-Instruct
60
#101GPT-4.1 nano
59
#102Qwen2.5-VL-32B
59
#103Qwen3 235B 2507
59
#104DeepSeek V3.1
59
#105GPT-OSS 20B
59
#106Kimi K2
59
#108Ministral 3 3B
59
#111Llama 4 Scout
58
#112Grok 3 [Beta]
58
#114GLM-4.5
57
#115GLM-4.5-Air
57
#116MiniMax M1 80k
57
#117Mistral 8x7B v0.2
57
#118o1-pro
52
#119GPT-5 nano
48
#120DBRX Instruct
46

FAQ

What does MMLU-ProX measure?

A multilingual extension of professional-level academic evaluation across many languages.

Which model scores highest on MMLU-ProX?

GPT-5.4 Pro by OpenAI currently leads with a score of 95 on MMLU-ProX.

How many models are evaluated on MMLU-ProX?

121 AI models have been evaluated on MMLU-ProX on BenchLM.

Last updated: March 12, 2026

Weekly LLM Benchmark Digest

Get notified when new models drop, benchmark scores change, or the leaderboard shifts. One email per week.

Free. No spam. Unsubscribe anytime. We only store derived location metadata for consent routing.