Massive Multi-discipline Multimodal Understanding Pro (MMMU-Pro)

A harder multimodal benchmark for frontier models that combines text with images, diagrams, charts, and academic visual reasoning tasks.

According to BenchLM.ai, GPT-5.2 Pro leads the MMMU-Pro benchmark with a score of 96, followed by GPT-5.4 (95) and GPT-5.2 (95). The top models are clustered within 1 points, suggesting this benchmark is nearing saturation for frontier models.

121 models have been evaluated on MMMU-Pro. The benchmark falls in the multimodalGrounded category, which carries a 15% weight in BenchLM.ai's overall scoring system. Strong performance here directly impacts a model's overall ranking.

About MMMU-Pro

Year

2025

Tasks

Multimodal academic reasoning

Format

Image + text question answering

Difficulty

Frontier multimodal

MMMU-Pro extends the original MMMU setup with more difficult multimodal questions and stronger separation at the top end of the model market.

MMMU-Pro

Leaderboard (121 models)

#1GPT-5.2 Pro
96
#2GPT-5.4
95
#3GPT-5.2
95
#4GPT-5.3 Instant
95
#5Claude Opus 4.6
95
#6Gemini 3.1 Pro
95
#7Grok 4.1
95
#9Claude Sonnet 4.6
95
#10Claude Sonnet 4.5
95
#11GPT-5.4 Pro
94
#12GPT-5.2 Instant
94
#13GPT-5.1
94
#14Claude Opus 4.5
94
#15Gemini 3 Pro
94
#16GPT-5 (high)
93
#18GPT-5.3 Codex
89
#19GPT-5 (medium)
89
#21GPT-5 mini
86
#22Gemini 2.5 Pro
86
#24GPT-5.2-Codex
84
#25Claude 4.1 Opus
82
#26Claude Haiku 4.5
82
#27Claude 4 Sonnet
81
#28Grok 4
80
#29Seed 1.6
80
#30Seed-2.0-Lite
80
#31Gemini 3 Flash
80
#32MiMo-V2-Flash
78
#33Claude 3.5 Sonnet
77
#34Mistral Large 3
75
#35Gemini 1.5 Pro
75
#36GLM-5 (Reasoning)
74
#37GPT-4o
74
#38Seed 1.6 Flash
74
#40Seed-2.0-Mini
74
#41o3-mini
73
#42Claude 3 Opus
73
#43Gemini 1.0 Pro
73
#44o1-preview
72
#45Kimi K2.5 (Reasoning)
72
#47o3-pro
70
#48o3
70
#49GPT-4.1
70
#50Ministral 3 14B
70
#51Claude 3 Haiku
70
#52Gemini 2.5 Flash
69
#53o1
68
#55GLM-4.7
66
#56GLM-5
66
#57Mercury 2
66
#58o4-mini (high)
66
#59GPT-4.1 mini
66
#60GPT-4o mini
66
#62DeepSeekMath V2
64
#63Step 3.5 Flash
64
#64Qwen2.5-72B
64
#65Qwen2.5-1M
63
#67DeepSeek V3.2
61
#69Kimi K2.5
61
#70Aion-2.0
61
#71DeepSeek LLM 2.0
60
#73Llama 4 Scout
60
#76GLM-4.7-Flash
58
#77GPT-5 nano
58
#78Qwen2.5-VL-32B
58
#79MiniMax M2.5
57
#80Qwen3.5 397B
56
#81Mistral Large 2
56
#84Phi-4
54
#85GPT-4.1 nano
53
#86GPT-4 Turbo
53
#87DeepSeek Coder 2.0
50
#88Llama 3 70B
50
#89Moonshot v1
49
#90o1-pro
48
#91Nemotron-4 15B
46
#92Z-1
46
#93DeepSeek-R1
43
#94GPT-OSS 120B
42
#95Mistral 8x7B
42
#98Gemma 3 27B
39
#99LFM2-24B-A2B
39
#102Qwen3 235B 2507
38
#105Nova Pro
37
#106DBRX Instruct
36
#107GLM-4.5
36
#108GLM-4.5-Air
36
#109DeepSeek V3.1
35
#111Kimi K2
35
#112MiniMax M1 80k
34
#113GPT-OSS 20B
31
#115LFM2.5-1.2B-Thinking
27
#116Ministral 3 8B
27
#117Mistral 7B v0.3
27
#118LFM2.5-1.2B-Instruct
27
#119Mistral 8x7B v0.2
26
#121Ministral 3 3B
25

FAQ

What does MMMU-Pro measure?

A harder multimodal benchmark for frontier models that combines text with images, diagrams, charts, and academic visual reasoning tasks.

Which model scores highest on MMMU-Pro?

GPT-5.2 Pro by OpenAI currently leads with a score of 96 on MMMU-Pro.

How many models are evaluated on MMMU-Pro?

121 AI models have been evaluated on MMMU-Pro on BenchLM.

Last updated: March 12, 2026

Weekly LLM Benchmark Digest

Get notified when new models drop, benchmark scores change, or the leaderboard shifts. One email per week.

Free. No spam. Unsubscribe anytime. We only store derived location metadata for consent routing.