MMMU-Pro with Python (MMMU-Pro w/ Python)

Tool-augmented MMMU-Pro variant that allows Python assistance during multimodal reasoning.

Top Models on MMMU-Pro w/ Python — March 2026

As of March 2026, GPT-5.4 leads the MMMU-Pro w/ Python leaderboard with 81.5% , followed by GPT-5.2 (80.4%) and GPT-5.4 mini (78%).

5 modelsMultimodal & GroundedUpdated March 17, 2026

According to BenchLM.ai, GPT-5.4 leads the MMMU-Pro w/ Python benchmark with a score of 81.5%, followed by GPT-5.2 (80.4%) and GPT-5.4 mini (78%). The scores show moderate spread, with meaningful differences between the top tier and mid-tier models.

5 models have been evaluated on MMMU-Pro w/ Python. The benchmark falls in the Multimodal & Grounded category, which carries a 12% weight in BenchLM.ai's overall scoring system. MMMU-Pro w/ Python is currently displayed for reference but excluded from the scoring formula, so it does not directly affect overall rankings.

About MMMU-Pro w/ Python

Year

2026

Tasks

Multimodal academic reasoning

Format

Image + text question answering with Python

Difficulty

Frontier multimodal

Useful for measuring multimodal reasoning when the model can combine visual understanding with computation.

Introducing GPT-5.4 mini and nano

Leaderboard (5 models)

#1GPT-5.4
81.5%
#2GPT-5.2
80.4%
#3GPT-5.4 mini
78%
#4GPT-5 mini
74.1%
#5GPT-5.4 nano
69.5%

FAQ

What does MMMU-Pro w/ Python measure?

Tool-augmented MMMU-Pro variant that allows Python assistance during multimodal reasoning.

Which model scores highest on MMMU-Pro w/ Python?

GPT-5.4 by OpenAI currently leads with a score of 81.5% on MMMU-Pro w/ Python.

How many models are evaluated on MMMU-Pro w/ Python?

5 AI models have been evaluated on MMMU-Pro w/ Python on BenchLM.

Last updated: March 17, 2026

Weekly LLM Benchmark Digest

Get notified when new models drop, benchmark scores change, or the leaderboard shifts. One email per week.

Free. No spam. Unsubscribe anytime. We only store derived location metadata for consent routing.