benchmarksknowledgemmluexplainer

MMLU vs MMLU-Pro: What Changed and Why It Matters

MMLU and MMLU-Pro are the most cited knowledge benchmarks in AI. Here's what each measures, why MMLU is saturated, and why MMLU-Pro is the better discriminator in 2026.

Glevd·March 7, 2026·7 min read

MMLU (Massive Multitask Language Understanding) has been the go-to knowledge benchmark since 2020. It tests models across 57 academic subjects with multiple-choice questions ranging from elementary to professional difficulty. But with frontier models now scoring 97-99%, it's lost its ability to separate the best from the rest.

MMLU-Pro was designed to fix this.

How MMLU works

MMLU presents 4-choice multiple-choice questions across subjects like history, biology, computer science, law, and mathematics. A model reads a question and picks A, B, C, or D.

With 4 choices, random guessing gives you 25%. Early models struggled to beat 40-50%. Today's frontier models score 97-99%, meaning the benchmark is effectively saturated.

See current scores: MMLU leaderboard

What MMLU-Pro changes

MMLU-Pro makes three key improvements:

  1. 10 answer choices instead of 4 — Random guessing drops from 25% to 10%, reducing the role of luck
  2. More reasoning-focused questions — Harder questions that require multi-step thinking, not just recall
  3. Better discrimination — Top model scores range from ~85-91 instead of 97-99, creating meaningful separation

This makes MMLU-Pro a much better benchmark for comparing frontier models. A 5-point gap on MMLU-Pro is more informative than a 1-point gap on MMLU.

See current scores: MMLU-Pro leaderboard

Current rankings comparison

Model MMLU MMLU-Pro
GPT-5.4 99 91
Claude Opus 4.6 99 89
GPT-5.3 Codex 99 90
GPT-5.2 98 87
Gemini 3.1 Pro 97 87

On MMLU, the top 5 models are within 2 points. On MMLU-Pro, the spread widens to 4 points. That's the difference between "all models are basically the same" and "there are real performance differences here."

Which should you look at?

If you're evaluating models in 2026:

  • Use MMLU-Pro for comparing frontier models. It's harder and better at showing real differences.
  • Use MMLU as a baseline — a model scoring below 90 on MMLU may not be competitive for knowledge-intensive tasks.
  • Combine with HLEHumanity's Last Exam is even harder and shows the largest spread among frontier models (10-46% scores).

For a complete view, check our knowledge benchmark rankings or compare specific models on their benchmark detail pages.


All scores from BenchLM.ai. Last updated March 2026.

Weekly LLM Updates

New model releases, benchmark scores, and leaderboard changes. Every Friday.

Free. Unsubscribe anytime.