MMLU and MMLU-Pro are the most cited knowledge benchmarks in AI. Here's what each measures, why MMLU is saturated, and why MMLU-Pro is the better discriminator in 2026.
Share This Report
Copy the link, post it, or save a PDF version.
MMLU is saturated — frontier models score 97-99% and the top 5 models are separated by just 2 points. MMLU-Pro fixes this with 10-choice questions and harder reasoning problems, creating a meaningful 85-91 spread that actually differentiates today's best models.
MMLU (Massive Multitask Language Understanding) has been the go-to knowledge benchmark since 2020. It tests models across 57 academic subjects with multiple-choice questions ranging from elementary to professional difficulty. But with frontier models now scoring 97-99%, it's lost its ability to separate the best from the rest.
MMLU-Pro was designed to fix this.
MMLU presents 4-choice multiple-choice questions across subjects like history, biology, computer science, law, and mathematics. A model reads a question and picks A, B, C, or D.
With 4 choices, random guessing gives you 25%. Early models struggled to beat 40-50%. Today's frontier models score 97-99%, meaning the benchmark is effectively saturated.
See current scores: MMLU leaderboard
MMLU-Pro makes three key improvements:
This makes MMLU-Pro a much better benchmark for comparing frontier models. A 5-point gap on MMLU-Pro is more informative than a 1-point gap on MMLU.
See current scores: MMLU-Pro leaderboard
| Model | MMLU | MMLU-Pro |
|---|---|---|
| GPT-5.4 | 99 | 91 |
| Claude Opus 4.6 | 99 | 89 |
| GPT-5.3 Codex | 99 | 90 |
| GPT-5.2 | 98 | 87 |
| Gemini 3.1 Pro | 97 | 87 |
On MMLU, the top 5 models are within 2 points. On MMLU-Pro, the spread widens to 4 points. That's the difference between "all models are basically the same" and "there are real performance differences here."
If you're evaluating models in 2026:
For a complete view, check our knowledge benchmark rankings or compare specific models on their benchmark detail pages.
→ See all models ranked on the full leaderboard
What is MMLU? MMLU (Massive Multitask Language Understanding) tests AI models across 57 academic subjects with 4-choice multiple-choice questions. Introduced in 2020, it covers history, biology, law, computer science, and mathematics. Frontier models now score 97-99%, making it effectively saturated for comparing the best models.
What is MMLU-Pro and how does it differ from MMLU? MMLU-Pro uses 10-choice questions instead of 4, includes harder reasoning-focused problems, and produces a wider score spread (85-91 for top models vs 97-99 for MMLU). The extra answer choices reduce lucky guessing and force models to reason rather than recall. It is the better benchmark for comparing frontier models in 2026.
Is MMLU still a useful benchmark in 2026? MMLU has limited value for comparing frontier models — the top 5 all score 97-99% with almost no signal. It remains useful as a floor check: a model scoring below 90 on MMLU is unlikely to be competitive for knowledge-intensive tasks. But MMLU-Pro and HLE are better discriminators.
Which model scores highest on MMLU-Pro? As of March 2026, GPT-5.4 leads MMLU-Pro at 91, followed by GPT-5.3 Codex (90) and Claude Opus 4.6 (89). See the MMLU-Pro leaderboard for current rankings.
What benchmark should I use instead of MMLU? Use MMLU-Pro for broad knowledge depth, SuperGPQA for coverage across 285 disciplines, or HLE for the hardest questions (top models score 10-46%). Any of these provides more signal than MMLU for today's frontier models.
All scores from BenchLM.ai. Last updated March 2026.
New models drop every week. We send one email a week with what moved and why.
Share This Report
Copy the link, post it, or save a PDF version.
On this page
Which models moved up, what’s new, and what it costs. One email a week, 3-min read.
Free. One email per week.
GPQA tests AI models with graduate-level questions in biology, physics, and chemistry that are 'Google-proof' — even skilled non-experts with internet access can't answer them. Here's how it works.
Humanity's Last Exam is crowdsourced from thousands of domain experts and designed to probe the absolute frontier of AI. Top models score only 10-46%. Here's why HLE matters.
React Native Evals measures whether AI coding models can complete real React Native implementation tasks across navigation, animation, and async state. Here's what it tests, why it matters, and how it differs from SWE-bench and LiveCodeBench.