Massive Multitask Language Understanding Professional (MMLU-Pro)

Name: Massive Multitask Language Understanding Professional
Creator: BenchLM

An enhanced version of MMLU with 10 answer choices instead of 4, featuring more reasoning-focused questions that better differentiate frontier models.

Top models on MMLU-Pro — May 13, 2026

As of May 13, 2026, Claude Opus 4.5 leads the MMLU-Pro leaderboard with 89.5% , followed by Qwen3.6 Plus (88.5%) and Qwen3.5 397B (87.8%).

1Closed

Claude Opus 4.5

Anthropic

89.5%

Overall 77Context 200K

2Closed

Qwen3.6 Plus

Alibaba

88.5%

Overall 73Context 1M

3Open

Qwen3.5 397B

Alibaba

87.8%

Overall 64Context 128K

34 modelsKnowledge22% of category scoreRefreshingUpdated May 13, 2026

According to BenchLM.ai, Claude Opus 4.5 leads the MMLU-Pro benchmark with a score of 89.5%, followed by Qwen3.6 Plus (88.5%) and Qwen3.5 397B (87.8%). The top models are clustered within 1.7 points, suggesting this benchmark is nearing saturation for frontier models.

34 models have been evaluated on MMLU-Pro. The benchmark falls in the Knowledge category. This category carries a 12% weight in BenchLM.ai's overall scoring system. Within that category, MMLU-Pro contributes 22% of the category score, so strong performance here directly affects a model's overall ranking.

About MMLU-Pro

Year

2024

Tasks

Multiple subjects

Format

10-way multiple choice

Difficulty

Professional level

MMLU-Pro increases the number of choices from 4 to 10 and integrates more reasoning-focused problems, reducing the chance of correct guessing and better evaluating true understanding. It serves as a more robust discriminator of model capabilities.

MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark

BenchLM freshness & provenance

Version

MMLU-Pro

Refresh cadence

Static

Staleness state

Refreshing

Question availability

Public benchmark set

Refreshing

BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.

Leaderboard (34 models)

Claude Opus 4.5

AnthropicClosed

89.5%

Qwen3.6 Plus

AlibabaClosed

88.5%

Qwen3.5 397B

AlibabaOpen

87.8%

DeepSeek V4 Pro (Max)

DeepSeekOpen

87.5%

DeepSeek V4 Pro (High)

DeepSeekOpen

87.1%

Kimi K2.5 (Reasoning)

Moonshot AIClosed

87.1%

Kimi K2.5

Moonshot AIOpen

87.1%

Qwen3.5-122B-A10B

AlibabaOpen

86.7%

DeepSeek V4 Flash (High)

DeepSeekOpen

86.4%

DeepSeek V4 Flash (Max)

DeepSeekOpen

86.2%

Qwen3.6-27B

AlibabaOpen

86.2%

Qwen3.5-27B

AlibabaOpen

86.1%

GLM-5

Z.AIOpen

85.7%

Qwen3.5-35B-A3B

AlibabaOpen

85.3%

Qwen3.6-35B-A3B

AlibabaOpen

85.2%

Gemma 4 31B

GoogleOpen

85.2%

MiMo-V2-Flash

XiaomiOpen

84.9%

GLM-4.7

Z.AIOpen

84.3%

DeepSeek V4 Flash

DeepSeekOpen

83%

Qwen3 235B 2507

AlibabaOpen

83%

DeepSeek V4 Pro

DeepSeekOpen

82.9%

Gemma 4 26B A4B

GoogleOpen

82.6%

Claude Opus 4.6

AnthropicClosed

82%

Exaone 4.0 32B

LG AI ResearchOpen

81.8%

Claude Sonnet 4.6

AnthropicClosed

79.2%

Nemotron 3 Nano Omni 30B A3B

NVIDIAOpen

77.3%

DeepSeek V3

DeepSeekOpen

75.9%

ZAYA1-8B

ZyphraOpen

74.2%

DeepSeek V4 Pro Base

DeepSeekOpen

73.5%

Gemma 4 E4B

GoogleOpen

69.4%

DeepSeek V4 Flash Base

DeepSeekOpen

68.3%

ZAYA1-74B-Preview

ZyphraOpen

68.1%

Gemma 4 E2B

GoogleOpen

60%

LFM2.5-VL-450M

LiquidAIOpen

19.3%

FAQ

What does MMLU-Pro measure?

An enhanced version of MMLU with 10 answer choices instead of 4, featuring more reasoning-focused questions that better differentiate frontier models.

Which model scores highest on MMLU-Pro?

Claude Opus 4.5 by Anthropic currently leads with a score of 89.5% on MMLU-Pro.

How many models are evaluated on MMLU-Pro?

34 AI models have been evaluated on MMLU-Pro on BenchLM.

Compare Top Models on MMLU-Pro

Claude Opus 4.5 vs Qwen3.6 Plus Qwen3.6 Plus vs Qwen3.5 397B Qwen3.5 397B vs DeepSeek V4 Pro (Max)DeepSeek V4 Pro (Max) vs DeepSeek V4 Pro (High)

Learn More

Read our explainer: MMLU-Pro benchmark deep dive

Last updated: May 13, 2026 · BenchLM version MMLU-Pro

The AI models change fast. We track them for you.

For engineers, researchers, and the plain curious — a weekly brief on new models, ranking shifts, and pricing changes.

Free. No spam. Unsubscribe anytime.