A multilingual extension of professional-level academic evaluation across many languages.
As of April 29, 2026, Claude Opus 4.5 leads the MMLU-ProX leaderboard with 85.7% , followed by Qwen3.6 Plus (84.7%) and Qwen3.5 397B (84.7%).
Claude Opus 4.5
Anthropic
Qwen3.6 Plus
Alibaba
Qwen3.5 397B
Alibaba
According to BenchLM.ai, Claude Opus 4.5 leads the MMLU-ProX benchmark with a score of 85.7%, followed by Qwen3.6 Plus (84.7%) and Qwen3.5 397B (84.7%). The top models are clustered within 1.0 points, suggesting this benchmark is nearing saturation for frontier models.
9 models have been evaluated on MMLU-ProX. The benchmark falls in the Multilingual category. This category carries a 7% weight in BenchLM.ai's overall scoring system. Within that category, MMLU-ProX contributes 65% of the category score, so strong performance here directly affects a model's overall ranking.
Year
2025
Tasks
Multilingual professional QA
Format
Multilingual multiple choice
Difficulty
Professional multilingual
MMLU-ProX expands multilingual evaluation beyond translated arithmetic, making it a better signal for broad cross-lingual reasoning and knowledge.
Version
MMLU-ProX 2025
Refresh cadence
Static
Staleness state
Current
Question availability
Public benchmark set
BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.
A multilingual extension of professional-level academic evaluation across many languages.
Claude Opus 4.5 by Anthropic currently leads with a score of 85.7% on MMLU-ProX.
9 AI models have been evaluated on MMLU-ProX on BenchLM.
For engineers, researchers, and the plain curious — a weekly brief on new models, ranking shifts, and pricing changes.
Free. No spam. Unsubscribe anytime.