Multimodal & Grounded Benchmarks — MMMU-Pro & OfficeQA Leaderboard
Vision, document, and grounded enterprise workflow benchmarks
Bottom line: Multimodal is one of the fastest-evolving categories. Models that can read screenshots, charts, and documents are essential for enterprise copilots.
MMMU-Pro · OfficeQA Pro · MMMU-Pro w/ Python · OmniDocBench 1.5 · GDPval-AA · MedXpertQA (MM) · ZeroBench · Design2Code · Flame-VLM-Code · Vision2Web · ImageMining · MMSearch · MMSearch-Plus · SimpleVQA · Facts-VLM · V*
Best Multimodal & Grounded picks
BenchLM summaries for multimodal & grounded plus the practical tradeoffs users check next: open weights, price, speed, latency, and context.
Top AI Models for Multimodal & Grounded — April 2026
As of April 2026, GPT-5.4 Pro leads the provisional multimodal & grounded leaderboard with a weighted score of 100.0%, followed by Gemini 3 Pro Deep Think (100.0%) and Claude Mythos Preview (97.8%). BenchLM is currently showing 109 provisional-ranked models and 18 verified-ranked models in this category.
GPT-5.4 Pro
OpenAI
Gemini 3 Pro Deep Think
Claude Mythos Preview
Anthropic
Best multimodal reasoning. Top MMMU-Pro for academic and scientific visuals.
What changed
Claude Mythos Preview leads multimodal with the strongest MMMU-Pro score.
GPT-5.4 close second, with strong OfficeQA Pro performance.
Claude Opus 4.6 holds #3, consistent across all multimodal sub-tasks.
How to choose
Top models by benchmark
Frontier multimodal reasoning benchmark spanning charts, diagrams, tables, and visual academic problems(55% of category score)
Multimodal & Grounded Leaderboard
Updated April 10, 2026Sorted by multimodal & grounded weighted score. Switch between provisional-ranked and verified-ranked modes to see the broader public dataset versus sourced-only ranking. Click column headers to re-sort by overall score or any benchmark.
1 GPT-5.4 Pro OpenAI | 100% | 92 | 94% | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — |
2 Gemini 3 Pro Deep Think Google | 100% | Est.87 | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — |
3 Claude Mythos Preview Anthropic | 97.8% | 99 | 92.7% | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — |
4 Grok 4.1 xAI | 97.5% | Est.81 | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — |
5 GPT-5.1 OpenAI | 95.8% | Est.81 | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — |
6 GPT-5.3 Codex OpenAI | 95.3% | Est.89 | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — |
7 Claude Sonnet 4.6 Anthropic | 95% | 86 | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — |
8 Claude Sonnet 4.5 Anthropic | 94.2% | Est.68 | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — |
9 GPT-5 (high) OpenAI | 91.6% | Est.80 | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — |
10 Gemini 3.1 Pro Google | 90.4% | 94 | 83.9% | — | — | — | 1320 | 81.3% | 29.0% | — | — | — | — | — | — | 72.4% | — | — |
11 GPT-5.1-Codex-Max OpenAI | 89.8% | Est.79 | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — |
12 GPT-5 (medium) OpenAI | 89.4% | Est.74 | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — |
13 GPT-5.2-Codex OpenAI | 88.9% | Est.80 | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — |
14 | 88% | Est.72 | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — |
15 GPT-5.4 OpenAI | 87.9% | 94 | 81.2% | — | — | — | 1672 | 77.1% | 41.0% | — | — | — | — | — | — | 61.1% | — | — |
16 GPT-5.2 OpenAI | 86.3% | Est.84 | 79.5% | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — |
17 Gemini 3 Pro Google | 86% | Est.83 | 81% | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — |
18 Claude Opus 4.6 Anthropic | 84.2% | 92 | 77.3% | — | — | — | 1606 | 64.8% | — | — | — | — | — | — | — | — | — | — |
19 Gemini 2.5 Pro Google | 84.1% | Est.67 | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — |
20 Muse Spark Meta | 76.6% | 80 | 80.4% | — | — | — | 1444 | 78.4% | 33.0% | — | — | — | — | — | — | 71.3% | — | — |
21 Claude 4.1 Opus Anthropic | 76.1% | Est.53 | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — |
22 Claude 4 Sonnet Anthropic | 74.3% | Est.52 | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — |
23 Gemini 3 Flash Google | 74.2% | Est.67 | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — |
24 Qwen3.6 Plus Alibaba | 73.8% | 77 | 78.8% | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — |
25 | 72.7% | Est.85 | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — |
These rankings update weekly
Get notified when models move. One email a week with what changed and why.
Free. No spam. Unsubscribe anytime.
Score in Context
What these scores mean
Multimodal & Grounded carries a 12% weight in overall scoring. The weighted score blends MMMU-Pro (academic multimodal reasoning) and OfficeQA Pro (enterprise document understanding). A model can know facts in text and still fail when the information is in a chart, screenshot, or spreadsheet — this category measures that gap.
Known limitations
Not all models support image input — text-only models are excluded from this category entirely. OfficeQA Pro is relatively new and coverage is still building. Enterprise-specific document formats (scanned PDFs, handwritten notes) remain under-tested by all benchmarks.
How we weight
Multimodal & Grounded carries a 12% weight in BenchLM.ai's overall scoring. It remains important for enterprise copilots and document-heavy workflows where models need to interpret visuals, screenshots, and scanned artifacts.
This category tests whether a model can read the world as it actually appears in products: screenshots, charts, scanned documents, and mixed visual-text artifacts. See the multimodal leaderboard or compare with knowledge benchmarks.
Leaderboards exclude benchmark rows that BenchLM generated from other scores or cloned from reference models. When a weighted benchmark is missing after that filter, the category falls back to the remaining trustworthy public rows instead of filling the gap with synthetic values.
The full scoring rules, freshness handling, and runtime/pricing caveats live on the BenchLM methodology page.
| Benchmark | Weight | Status | Description |
|---|---|---|---|
| MMMU-Pro | 55% | Weighted | Frontier multimodal reasoning benchmark spanning charts, diagrams, tables, and visual academic problems |
| OfficeQA Pro | 45% | Weighted | Grounded office and enterprise document benchmark |
| MMMU-Pro w/ Python | — | Display only | Tool-augmented MMMU-Pro variant that allows Python assistance during multimodal reasoning |
| OmniDocBench 1.5 | — | Display only | Document understanding benchmark measured by edit distance on complex document extraction tasks |
| GDPval-AA | — | Display only | An evaluation focused on professional domain expertise and task delivery quality in office-style knowledge work. |
| MedXpertQA (MM) | — | Display only | A clinically grounded multimodal medical multiple-choice benchmark with image inputs. |
| ZeroBench | — | Display only | A multi-step visual reasoning benchmark with pass@5 reporting and optional tool use. |
| Design2Code | — | Display only | Multimodal coding benchmark for turning visual designs into working frontend implementations. |
| Flame-VLM-Code | — | Display only | Vision-language coding benchmark for generating correct code from visual and multimodal inputs. |
| Vision2Web | — | Display only | Benchmark for converting visual references into functional web implementations. |
| ImageMining | — | Display only | Multimodal retrieval and extraction benchmark over image-heavy task settings. |
| MMSearch | — | Display only | Multimodal search benchmark for retrieval and grounded answering across mixed-media inputs. |
| MMSearch-Plus | — | Display only | A harder MMSearch variant for multimodal retrieval and grounded tool-use workflows. |
| SimpleVQA | — | Display only | Visual question answering benchmark focused on straightforward image-grounded understanding. |
| Facts-VLM | — | Display only | Grounded multimodal factuality benchmark for evidence-linked answer correctness. |
| V* | — | Display only | Vision-centric benchmark for high-level multimodal reasoning and perception quality. |
Multimodal benchmark updates
Multimodal rankings are heating up. Get the weekly update.
Free. No spam. Unsubscribe anytime.
About Multimodal & Grounded Benchmarks
Frontier multimodal reasoning benchmark spanning charts, diagrams, tables, and visual academic problems