Massive Multitask Language Understanding (MMLU)

Name: Massive Multitask Language Understanding
Creator: BenchLM

A comprehensive multiple-choice question answering test covering 57 tasks including elementary mathematics, US history, computer science, law, and more. Tests knowledge across diverse academic subjects from high school to professional level.

Benchmark score on MMLU — June 2, 2026

BenchLM mirrors the published score view for MMLU. o1 leads the public snapshot at 91.8% , followed by GPT-4.1 (90.2%) and DeepSeek V4 Pro Base (90.1%). BenchLM does not use these results to rank models overall.

1Closed

OpenAI

91.8%

Overall ~57Context 200K

2Closed

GPT-4.1

OpenAI

90.2%

Overall ~57Context 1M

3Open

DeepSeek V4 Pro Base

DeepSeek

90.1%

Overall —Context 1M

8 modelsKnowledgeStaleSaturatedDisplay onlyUpdated June 2, 2026

The published MMLU snapshot is tightly clustered at the top: o1 sits at 91.8%, while the third row is only 1.7 points behind. The broader top-10 spread is 11.7 points, so the benchmark still separates strong models even when the leaders cluster.

8 models have been evaluated on MMLU. The benchmark falls in the Knowledge category. This category carries a 12% weight in BenchLM.ai's overall scoring system. MMLU is currently displayed for reference but excluded from the scoring formula, so it does not directly affect overall rankings.

About MMLU

Year

2020

Tasks

57 subjects

Format

Multiple choice questions

Difficulty

Elementary to professional level

MMLU evaluates models on 57 subjects spanning humanities, social sciences, STEM, and other areas. Questions range from elementary to advanced professional level, making it a comprehensive test of world knowledge and reasoning ability.

Measuring Massive Multitask Language Understanding

BenchLM freshness & provenance

Version

MMLU

Refresh cadence

Static

Staleness state

Stale

Question availability

Public benchmark set

StaleSaturatedDisplay only

BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.