MMLU and MMLU-Pro are the most cited knowledge benchmarks in AI. Here's what each measures, why MMLU is saturated, and why MMLU-Pro is the better discriminator in 2026.
MMLU (Massive Multitask Language Understanding) has been the go-to knowledge benchmark since 2020. It tests models across 57 academic subjects with multiple-choice questions ranging from elementary to professional difficulty. But with frontier models now scoring 97-99%, it's lost its ability to separate the best from the rest.
MMLU-Pro was designed to fix this.
MMLU presents 4-choice multiple-choice questions across subjects like history, biology, computer science, law, and mathematics. A model reads a question and picks A, B, C, or D.
With 4 choices, random guessing gives you 25%. Early models struggled to beat 40-50%. Today's frontier models score 97-99%, meaning the benchmark is effectively saturated.
See current scores: MMLU leaderboard
MMLU-Pro makes three key improvements:
This makes MMLU-Pro a much better benchmark for comparing frontier models. A 5-point gap on MMLU-Pro is more informative than a 1-point gap on MMLU.
See current scores: MMLU-Pro leaderboard
| Model | MMLU | MMLU-Pro |
|---|---|---|
| GPT-5.4 | 99 | 91 |
| Claude Opus 4.6 | 99 | 89 |
| GPT-5.3 Codex | 99 | 90 |
| GPT-5.2 | 98 | 87 |
| Gemini 3.1 Pro | 97 | 87 |
On MMLU, the top 5 models are within 2 points. On MMLU-Pro, the spread widens to 4 points. That's the difference between "all models are basically the same" and "there are real performance differences here."
If you're evaluating models in 2026:
For a complete view, check our knowledge benchmark rankings or compare specific models on their benchmark detail pages.
All scores from BenchLM.ai. Last updated March 2026.
On this page
New model releases, benchmark scores, and leaderboard changes. Every Friday.
Free. Unsubscribe anytime.
AIME and HMMT are high school math olympiad competitions now used to benchmark AI. Frontier models score 95-99% — competition math is effectively solved. Here's what that means.
We ranked every major LLM by coding benchmarks — HumanEval, SWE-bench Verified, and LiveCodeBench. Here's which model actually comes out on top, and why the answer depends on what you're building.
Chatbot Arena ranks AI models using Elo ratings from blind human preference votes. Here's how the system works, what the scores mean, and how Elo compares to standardized benchmarks.