Multimodal and grounded benchmarks test whether a model can reason over visual content — images, charts, documents, screenshots, and spreadsheets — not just process plain text. This category carries a 12% weight in BenchLM.ai's overall score. MMMU-Pro tests frontier-difficulty visual reasoning, while OfficeQA Pro focuses on enterprise document workflows. For products where users upload images, share PDFs, or need models to read dashboards and data tables, scores here are a better predictor of real performance than chat-only benchmarks. Most top proprietary models are competitive; open-weight models show wider spread.
According to BenchLM.ai, GPT-5.2 Pro leads this ranking with a score of 96, followed by GPT-5.4 (95.5) and GPT-5.2 (95). The top three are separated by just a few points — any of them would perform well for this use case.
The best open-weight option is GLM-5 (Reasoning) (ranked #30 with a score of 78.5). Proprietary models hold a clear advantage in this category, though open-weight options may suffice for less demanding use cases.
This ranking is based on average scores across all multimodalGrounded benchmarks tracked by BenchLM.ai. For detailed model profiles, click any model name below. To compare two specific models head-to-head, use the "vs #" links.
Get notified when new models drop, benchmark scores change, or the leaderboard shifts. One email per week.
Free. No spam. Unsubscribe anytime. We only store derived location metadata for consent routing.