Tool-augmented MMMU-Pro variant that allows Python assistance during multimodal reasoning.
As of March 2026, GPT-5.4 leads the MMMU-Pro w/ Python leaderboard with 81.5% , followed by GPT-5.2 (80.4%) and GPT-5.4 mini (78%).
GPT-5.4
OpenAI
GPT-5.2
OpenAI
GPT-5.4 mini
OpenAI
According to BenchLM.ai, GPT-5.4 leads the MMMU-Pro w/ Python benchmark with a score of 81.5%, followed by GPT-5.2 (80.4%) and GPT-5.4 mini (78%). The scores show moderate spread, with meaningful differences between the top tier and mid-tier models.
5 models have been evaluated on MMMU-Pro w/ Python. The benchmark falls in the Multimodal & Grounded category, which carries a 12% weight in BenchLM.ai's overall scoring system. MMMU-Pro w/ Python is currently displayed for reference but excluded from the scoring formula, so it does not directly affect overall rankings.
Year
2026
Tasks
Multimodal academic reasoning
Format
Image + text question answering with Python
Difficulty
Frontier multimodal
Useful for measuring multimodal reasoning when the model can combine visual understanding with computation.
Introducing GPT-5.4 mini and nanoTool-augmented MMMU-Pro variant that allows Python assistance during multimodal reasoning.
GPT-5.4 by OpenAI currently leads with a score of 81.5% on MMMU-Pro w/ Python.
5 AI models have been evaluated on MMMU-Pro w/ Python on BenchLM.
Get notified when new models drop, benchmark scores change, or the leaderboard shifts. One email per week.
Free. No spam. Unsubscribe anytime. We only store derived location metadata for consent routing.