Multimodal & Grounded

Multimodal & Grounded Benchmarks

Vision, document, and grounded enterprise workflow benchmarks

MMMU-Pro · OfficeQA Pro

Multimodal and grounded benchmarks test whether a model can read the world as it actually appears in products: screenshots, charts, scanned documents, office files, and mixed visual-text artifacts.

BenchLM.ai currently tracks MMMU-Pro for frontier multimodal reasoning and OfficeQA Pro for grounded enterprise-style document tasks.

This category carries a 12% weight in the overall score. With only two benchmarks, the weight was reduced from 15% until more multimodal evaluations become available, but it remains important for enterprise copilots and document-heavy workflows.

123 models
1
GPT-5.4 Pro
OpenAI
ClosedReasoning1.05M9194%96%
2
GPT-5.2 Pro
OpenAI
ClosedReasoning400K9096%96%
3
GPT-5.4
OpenAI
ClosedReasoning1.05M9095%96%
4
GPT-5.3 Codex
OpenAI
ClosedReasoning400K8989%94%
5
GPT-5.2
OpenAI
ClosedReasoning400K8895%95%
6
GPT-5.3 Instant
OpenAI
ClosedReasoning128K8795%95%
7
GPT-5.3-Codex-Spark
OpenAI
ClosedReasoning256K8786%91%
8
Claude Opus 4.6
Anthropic
ClosedStandard1M8595%94%
9
GPT-5.2 Instant
OpenAI
ClosedReasoning128K8594%92%
10
GPT-5.2-Codex
OpenAI
ClosedReasoning400K8584%92%
11
Gemini 3.1 Pro
Google
ClosedStandard1M8495%95%
12
GPT-5.1-Codex-Max
OpenAI
ClosedReasoning400K8485%92%
13
Grok 4.1
xAI
ClosedStandard1M8495%91%
14
Gemini 3 Pro Deep Think
Google
ClosedReasoning2M8195%95%
15
GPT-5.1
OpenAI
ClosedReasoning200K8094%89%
16
GPT-5 (high)
OpenAI
ClosedReasoning128K7993%85%
17
Claude Sonnet 4.6
Anthropic
ClosedStandard200K7895%88%
18
GLM-5 (Reasoning)
Zhipu AI
OpenReasoning200K7874%84%
19
GPT-5 (medium)
OpenAI
ClosedReasoning128K7889%87%
20
Claude Opus 4.5
Anthropic
ClosedStandard200K7794%87%
21
Gemini 3 Pro
Google
ClosedStandard2M7794%92%
22
o1-preview
OpenAI
ClosedReasoning200K7772%80%
23
Claude Sonnet 4.5
Anthropic
ClosedStandard200K7695%87%
24
Grok 4.1 Fast
xAI
ClosedStandard1M7691%83%
25
Kimi K2.5 (Reasoning)
Moonshot AI
ClosedReasoning128K7672%77%
Showing 25 of 123

About Multimodal & Grounded Benchmarks

Frontier multimodal reasoning benchmark spanning charts, diagrams, tables, and visual academic problems