A harder multimodal benchmark for frontier models that combines text with images, diagrams, charts, and academic visual reasoning tasks.
As of April 29, 2026, GPT-5.4 Pro leads the MMMU-Pro leaderboard with 94% , followed by Claude Mythos Preview (92.7%) and Gemini 3.1 Pro (83.9%).
GPT-5.4 Pro
OpenAI
Claude Mythos Preview
Anthropic
Gemini 3.1 Pro
According to BenchLM.ai, GPT-5.4 Pro leads the MMMU-Pro benchmark with a score of 94%, followed by Claude Mythos Preview (92.7%) and Gemini 3.1 Pro (83.9%). The scores show moderate spread, with meaningful differences between the top tier and mid-tier models.
23 models have been evaluated on MMMU-Pro. The benchmark falls in the Multimodal & Grounded category. This category carries a 12% weight in BenchLM.ai's overall scoring system. Within that category, MMMU-Pro contributes 45% of the category score, so strong performance here directly affects a model's overall ranking.
Year
2024
Tasks
Multimodal academic reasoning
Format
Image + text question answering
Difficulty
Frontier multimodal
MMMU-Pro extends the original MMMU setup with more difficult multimodal questions and stronger separation at the top end of the model market.
Version
MMMU-Pro 2024
Refresh cadence
Annual
Staleness state
Refreshing
Question availability
Public benchmark set
BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.
A harder multimodal benchmark for frontier models that combines text with images, diagrams, charts, and academic visual reasoning tasks.
GPT-5.4 Pro by OpenAI currently leads with a score of 94% on MMMU-Pro.
23 AI models have been evaluated on MMMU-Pro on BenchLM.
For engineers, researchers, and the plain curious — a weekly brief on new models, ranking shifts, and pricing changes.
Free. No spam. Unsubscribe anytime.