A comprehensive multiple-choice question answering test covering 57 tasks including elementary mathematics, US history, computer science, law, and more. Tests knowledge across diverse academic subjects from high school to professional level.
BenchLM mirrors the published score view for MMLU. o1 leads the public snapshot at 91.8% , followed by GPT-4.1 (90.2%) and GPT-4.1 mini (87.5%). BenchLM does not use these results to rank models overall.
o1
OpenAI
GPT-4.1
OpenAI
GPT-4.1 mini
OpenAI
The published MMLU snapshot is tightly clustered at the top: o1 sits at 91.8%, while the third row is only 4.3 points behind. The broader top-10 spread is 11.7 points, so the benchmark still separates strong models even when the leaders cluster.
5 models have been evaluated on MMLU. The benchmark falls in the Knowledge category. This category carries a 12% weight in BenchLM.ai's overall scoring system. MMLU is currently displayed for reference but excluded from the scoring formula, so it does not directly affect overall rankings.
Year
2020
Tasks
57 subjects
Format
Multiple choice questions
Difficulty
Elementary to professional level
MMLU evaluates models on 57 subjects spanning humanities, social sciences, STEM, and other areas. Questions range from elementary to advanced professional level, making it a comprehensive test of world knowledge and reasoning ability.
Version
MMLU
Refresh cadence
Static
Staleness state
Stale
Question availability
Public benchmark set
BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.
A comprehensive multiple-choice question answering test covering 57 tasks including elementary mathematics, US history, computer science, law, and more. Tests knowledge across diverse academic subjects from high school to professional level.
o1 by OpenAI currently leads with a score of 91.8% on MMLU.
5 AI models have been evaluated on MMLU on BenchLM.
For engineers, researchers, and the plain curious — a weekly brief on new models, ranking shifts, and pricing changes.
Free. No spam. Unsubscribe anytime.