Multimodal & Grounded Benchmarks
Vision, document, and grounded enterprise workflow benchmarks
MMMU-Pro · OfficeQA Pro
Multimodal and grounded benchmarks test whether a model can read the world as it actually appears in products: screenshots, charts, scanned documents, office files, and mixed visual-text artifacts.
BenchLM.ai currently tracks MMMU-Pro for frontier multimodal reasoning and OfficeQA Pro for grounded enterprise-style document tasks.
This category carries a 12% weight in the overall score. With only two benchmarks, the weight was reduced from 15% until more multimodal evaluations become available, but it remains important for enterprise copilots and document-heavy workflows.
1 GPT-5.4 Pro OpenAI | Closed | Reasoning | 1.05M | 91 | 94% | 96% |
2 GPT-5.2 Pro OpenAI | Closed | Reasoning | 400K | 90 | 96% | 96% |
3 GPT-5.4 OpenAI | Closed | Reasoning | 1.05M | 90 | 95% | 96% |
4 GPT-5.3 Codex OpenAI | Closed | Reasoning | 400K | 89 | 89% | 94% |
5 GPT-5.2 OpenAI | Closed | Reasoning | 400K | 88 | 95% | 95% |
6 GPT-5.3 Instant OpenAI | Closed | Reasoning | 128K | 87 | 95% | 95% |
7 GPT-5.3-Codex-Spark OpenAI | Closed | Reasoning | 256K | 87 | 86% | 91% |
8 Claude Opus 4.6 Anthropic | Closed | Standard | 1M | 85 | 95% | 94% |
9 GPT-5.2 Instant OpenAI | Closed | Reasoning | 128K | 85 | 94% | 92% |
10 GPT-5.2-Codex OpenAI | Closed | Reasoning | 400K | 85 | 84% | 92% |
11 Gemini 3.1 Pro Google | Closed | Standard | 1M | 84 | 95% | 95% |
12 GPT-5.1-Codex-Max OpenAI | Closed | Reasoning | 400K | 84 | 85% | 92% |
13 Grok 4.1 xAI | Closed | Standard | 1M | 84 | 95% | 91% |
14 Gemini 3 Pro Deep Think Google | Closed | Reasoning | 2M | 81 | 95% | 95% |
15 GPT-5.1 OpenAI | Closed | Reasoning | 200K | 80 | 94% | 89% |
16 GPT-5 (high) OpenAI | Closed | Reasoning | 128K | 79 | 93% | 85% |
17 Claude Sonnet 4.6 Anthropic | Closed | Standard | 200K | 78 | 95% | 88% |
18 GLM-5 (Reasoning) Zhipu AI | Open | Reasoning | 200K | 78 | 74% | 84% |
19 GPT-5 (medium) OpenAI | Closed | Reasoning | 128K | 78 | 89% | 87% |
20 Claude Opus 4.5 Anthropic | Closed | Standard | 200K | 77 | 94% | 87% |
21 Gemini 3 Pro Google | Closed | Standard | 2M | 77 | 94% | 92% |
22 o1-preview OpenAI | Closed | Reasoning | 200K | 77 | 72% | 80% |
23 Claude Sonnet 4.5 Anthropic | Closed | Standard | 200K | 76 | 95% | 87% |
24 Grok 4.1 Fast xAI | Closed | Standard | 1M | 76 | 91% | 83% |
25 Kimi K2.5 (Reasoning) Moonshot AI | Closed | Reasoning | 128K | 76 | 72% | 77% |
About Multimodal & Grounded Benchmarks
Frontier multimodal reasoning benchmark spanning charts, diagrams, tables, and visual academic problems