Benchmark profile

Massive Multitask Language Understanding Professional (MMLU-Pro)

An enhanced version of MMLU with 10 answer choices instead of 4, featuring more reasoning-focused questions that better differentiate frontier models.

Data verified July 20, 2026

Top models on MMLU-Pro — July 20, 2026

As of July 20, 2026, Qwen3.7 Max leads the MMLU-Pro leaderboard with 89.6% , followed by Claude Opus 4.5 (89.5%) and Qwen3.7 Plus (88.5%).

1Closed

Qwen3.7 Max

Alibaba

qwen3-7-max

89.6%

Overall 72.84Context 1M

2Closed

Claude Opus 4.5

Anthropic

claude-opus-4-5

89.5%

Overall 64.22Context 200K

3Closed

Qwen3.7 Plus

Alibaba

qwen3-7-plus

88.5%

Overall 67.22Context 1M

42 modelsKnowledge30% of category scoreRefreshingUpdated July 20, 2026

Leaderboard (42 models)

Score

Qwen3.7 MaxAlibaba · Closed

89.6%

Claude Opus 4.5Anthropic · Closed

89.5%

Qwen3.7 PlusAlibaba · Closed

88.5%

Qwen3.6 PlusAlibaba · Closed

88.5%

Qwen3.5 397BAlibaba · Open weight

87.8%

DeepSeek V4 Pro (Max)DeepSeek · Open weight

87.5%

DeepSeek V4 Pro (High)DeepSeek · Open weight

87.1%

Kimi K2.5Moonshot AI · Open weight

87.1%

Kimi K2.5 (Reasoning)Moonshot AI · Closed

87.1%

Nemotron 3 UltraNVIDIA · Open weight

86.8%

Qwen3.5-122B-A10BAlibaba · Open weight

86.7%

DeepSeek V4 Flash (High)DeepSeek · Open weight

86.4%

DeepSeek V4 Flash (Max)DeepSeek · Open weight

86.2%

Qwen3.6-27BAlibaba · Open weight

86.2%

Qwen3.5-27BAlibaba · Open weight

86.1%

GLM-5Z.AI · Open weight

85.7%

Qwen3.5-35B-A3BAlibaba · Open weight

85.3%

Qwen3.6-35B-A3BAlibaba · Open weight

85.2%

Gemma 4 31BGoogle · Open weight

85.2%

MAI-Thinking-1Microsoft · Closed

85%

MiMo-V2-FlashXiaomi · Open weight

84.9%

GLM-4.7Z.AI · Open weight

84.3%

Qwen3 235B 2507Alibaba · Open weight

83%

DeepSeek V4 FlashDeepSeek · Open weight

83%

DeepSeek V4 ProDeepSeek · Open weight

82.9%

Gemma 4 26B A4BGoogle · Open weight

82.6%

Claude Opus 4.6Anthropic · Closed

82%

Exaone 4.0 32BLG AI Research · Open weight

81.8%

Claude Sonnet 4.6Anthropic · Closed

79.2%

Nemotron 3 Nano Omni 30B A3BNVIDIA · Open weight

77.3%

Gemma 4 12BGoogle · Open weight

77.2%

DeepSeek V3DeepSeek · Open weight

75.9%

ZAYA1-8BZyphra · Open weight

74.2%

DeepSeek V4 Pro BaseDeepSeek · Open weight

73.5%

Gemma 4 E4BGoogle · Open weight

69.4%

DeepSeek V4 Flash BaseDeepSeek · Open weight

68.3%

ZAYA1-74B-PreviewZyphra · Open weight

68.1%

Gemma 4 E2BGoogle · Open weight

60%

Soofi S 30B-A3BSoofi Project · Open weight

51.4%

MiniCPM5-1BOpenBMB · Open weight

48.9%

LFM2.5-230MLiquidAI · Open weight

20.3%

LFM2.5-VL-450MLiquidAI · Open weight

19.3%

According to BenchLM.ai, Qwen3.7 Max leads the MMLU-Pro benchmark with a score of 89.6%, followed by Claude Opus 4.5 (89.5%) and Qwen3.7 Plus (88.5%). The top models are clustered within 1.1 points, suggesting this benchmark is nearing saturation for frontier models.

42 models have been evaluated on MMLU-Pro. The benchmark falls in the Knowledge category. This category carries a 12% weight in BenchLM.ai's overall scoring system. Within that category, MMLU-Pro contributes 30% of the category score, so strong performance here directly affects a model's overall ranking.

About MMLU-Pro

Year

2024

Tasks

Multiple subjects

Format

10-way multiple choice

Difficulty

Professional level

MMLU-Pro increases the number of choices from 4 to 10 and integrates more reasoning-focused problems, reducing the chance of correct guessing and better evaluating true understanding. It serves as a more robust discriminator of model capabilities.

MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark

BenchLM freshness & provenance

Version

MMLU-Pro

Refresh cadence

Static

Staleness state

Refreshing

Question availability

Public benchmark set

Refreshing

BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.

FAQ

What does MMLU-Pro measure?

An enhanced version of MMLU with 10 answer choices instead of 4, featuring more reasoning-focused questions that better differentiate frontier models.

Which model scores highest on MMLU-Pro?

Qwen3.7 Max by Alibaba currently leads with a score of 89.6% on MMLU-Pro.

How many models are evaluated on MMLU-Pro?

42 AI models have been evaluated on MMLU-Pro on BenchLM.

Compare Top Models on MMLU-Pro

Qwen3.7 Max vs Claude Opus 4.5 Claude Opus 4.5 vs Qwen3.7 Plus Qwen3.7 Plus vs Qwen3.6 Plus Qwen3.6 Plus vs Qwen3.5 397B

Learn More

Read our explainer: MMLU-Pro benchmark deep dive

Last updated: July 20, 2026 · BenchLM version MMLU-Pro

Choose a model with this week’s evidence

Join 2,000+ readers for ranking moves, pricing changes, and the claims that still need proof.

One email each week. Unsubscribe anytime.