Skip to main content

Massive Multitask Language Understanding (MMLU)

A comprehensive multiple-choice question answering test covering 57 tasks including elementary mathematics, US history, computer science, law, and more. Tests knowledge across diverse academic subjects from high school to professional level.

Benchmark score on MMLU — April 20, 2026

BenchLM mirrors the published score view for MMLU. o1 leads the public snapshot at 91.8% , followed by GPT-4.1 (90.2%) and GPT-4.1 mini (87.5%). BenchLM does not use these results to rank models overall.

5 modelsKnowledgeStaleSaturatedDisplay onlyUpdated April 20, 2026

The published MMLU snapshot is tightly clustered at the top: o1 sits at 91.8%, while the third row is only 4.3 points behind. The broader top-10 spread is 11.7 points, so the benchmark still separates strong models even when the leaders cluster.

5 models have been evaluated on MMLU. The benchmark falls in the Knowledge category. This category carries a 12% weight in BenchLM.ai's overall scoring system. MMLU is currently displayed for reference but excluded from the scoring formula, so it does not directly affect overall rankings.

About MMLU

Year

2020

Tasks

57 subjects

Format

Multiple choice questions

Difficulty

Elementary to professional level

MMLU evaluates models on 57 subjects spanning humanities, social sciences, STEM, and other areas. Questions range from elementary to advanced professional level, making it a comprehensive test of world knowledge and reasoning ability.

BenchLM freshness & provenance

Version

MMLU

Refresh cadence

Static

Staleness state

Stale

Question availability

Public benchmark set

StaleSaturatedDisplay only

BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.

Benchmark score table (5 models)

1
91.8%
2
90.2%
3
87.5%
4
86.9%
5
80.1%

FAQ

What does MMLU measure?

A comprehensive multiple-choice question answering test covering 57 tasks including elementary mathematics, US history, computer science, law, and more. Tests knowledge across diverse academic subjects from high school to professional level.

Which model scores highest on MMLU?

o1 by OpenAI currently leads with a score of 91.8% on MMLU.

How many models are evaluated on MMLU?

5 AI models have been evaluated on MMLU on BenchLM.

Last updated: April 20, 2026 · BenchLM version MMLU

The AI models change fast. We track them for you.

For engineers, researchers, and the plain curious — a weekly brief on new models, ranking shifts, and pricing changes.

Free. No spam. Unsubscribe anytime.