Multimodal & Grounded Benchmarks — MMMU-Pro, OfficeQA & CharXiv Leaderboard
Vision, document, and grounded enterprise workflow benchmarks
Bottom line: Multimodal is one of the fastest-evolving categories. Models that can read screenshots, charts, and documents are essential for enterprise copilots.
MMMU-Pro · OfficeQA Pro · MMMU-Pro w/ Python · OmniDocBench 1.5 · GDPval-AA · MedXpertQA (MM) · ZeroBench · Design2Code · Flame-VLM-Code · Vision2Web · ImageMining · MMSearch · MMSearch-Plus · SimpleVQA · Facts-VLM · V*
Best Multimodal & Grounded picks
BenchLM summaries for multimodal & grounded plus the practical tradeoffs users check next: open weights, price, speed, latency, and context.
Gemini 3 Pro Deep Think
100
category score
DeepSeek V4 Pro (Max)
87
overall score
DeepSeek
Qwen3.6-27B
$0.00
avg / 1M tokens
Alibaba
Mercury 2
789
tokens / sec
Inception
LFM2-24B-A2B
0.42s
TTFT
LiquidAI
Nemotron 3 Ultra 500B
10M
context window
NVIDIA
Top AI Models for Multimodal & Grounded — April 2026
As of April 2026, Gemini 3 Pro Deep Think leads the provisional multimodal & grounded leaderboard with a weighted score of 100.0%, followed by Grok 4.1 (97.8%) and Claude Mythos Preview (97.6%). BenchLM is currently showing 105 provisional-ranked models and 16 verified-ranked models in this category.
Gemini 3 Pro Deep Think
Grok 4.1
xAI
Claude Mythos Preview
Anthropic
Best multimodal reasoning. Top MMMU-Pro for academic and scientific visuals.
What changed
Claude Mythos Preview leads multimodal with the strongest MMMU-Pro score.
GPT-5.4 close behind with strong OfficeQA Pro and MMMU-Pro results.
Claude Opus 4.7 adds official CharXiv visual reasoning coverage.
How to choose
Top models by benchmark
Frontier multimodal reasoning benchmark spanning charts, diagrams, tables, and visual academic problems(45% of category score)
Multimodal & Grounded Leaderboard
Updated April 24, 2026Sorted by multimodal & grounded weighted score. Switch between provisional-ranked and verified-ranked modes to see the broader public dataset versus sourced-only ranking. Click column headers to re-sort by overall score or any benchmark.
1 Gemini 3 Pro Deep Think Google | 100% | Est.90 | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — |
2 Grok 4.1 xAI | 97.8% | Est.90 | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — |
3 Claude Mythos Preview Anthropic | 97.6% | 99 | 92.7% | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — |
4 GPT-5.1 OpenAI | 96.3% | Est.79 | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — |
5 GPT-5.3 Codex OpenAI | 94.8% | Est.88 | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — |
6 Claude Sonnet 4.5 Anthropic | 94.8% | Est.66 | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — |
7 GPT-5 (high) OpenAI | 92.3% | Est.78 | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — |
8 GPT-5 (medium) OpenAI | 89.6% | Est.72 | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — |
9 GPT-5.1-Codex-Max OpenAI | 89.2% | Est.77 | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — |
10 | 88.7% | Est.70 | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — |
11 GPT-5.2-Codex OpenAI | 88.2% | Est.78 | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — |
12 Gemini 2.5 Pro Google | 84.3% | Est.65 | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — |
13 Claude Sonnet 4.6 Anthropic | 83.8% | 83 | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — |
14 Gemini 3.1 Pro Google | 81.6% | 92 | 83.9% | — | — | — | 1320 | 81.3% | 29.0% | — | — | — | — | — | — | 72.4% | — | — |
15 GPT-5.2 OpenAI | 79.8% | 81 | 79.5% | — | — | — | — | — | — | — | — | — | — | — | — | — | — | 75.9% |
16 Gemini 3 Pro Google | 79.2% | 81 | 81% | — | — | — | — | — | — | — | — | — | — | — | — | — | — | 88.0% |
17 Muse Spark Meta | 77.5% | 82 | 80.4% | — | — | — | 1444 | 78.4% | 33.0% | — | — | — | — | — | — | 71.3% | — | — |
18 Claude 4.1 Opus Anthropic | 76.5% | Est.52 | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — |
19 Claude 4 Sonnet Anthropic | 74.7% | Est.51 | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — |
20 Claude Opus 4.6 Anthropic | 73.6% | 87 | 77.3% | — | — | — | 1606 | 64.8% | — | — | — | — | — | — | — | — | — | — |
21 Claude Haiku 4.5 Anthropic | 72.8% | Est.58 | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — |
22 Grok 4 xAI | 72.2% | Est.65 | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — |
23 | 71.9% | Est.83 | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — |
24 Gemini 3 Flash Google | 69.8% | Est.65 | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — |
25 Qwen3.6 Plus Alibaba | 68.9% | 74 | 78.8% | — | — | — | — | — | — | — | — | — | — | — | — | — | — | 96.9% |
These rankings update weekly
Get notified when models move. One email a week with what changed and why.
Free. No spam. Unsubscribe anytime.
Score in Context
What these scores mean
Multimodal & Grounded carries a 12% weight in overall scoring. The weighted score blends MMMU-Pro (academic multimodal reasoning), OfficeQA Pro (enterprise document understanding), and CharXiv (visual chart and figure reasoning). A model can know facts in text and still fail when the information is in a chart, screenshot, or spreadsheet — this category measures that gap.
Known limitations
Not all models support image input — text-only models are excluded from this category entirely. OfficeQA Pro and CharXiv coverage is still building, so rankings should be read as a blend of available public evidence rather than a complete visual capability profile. Enterprise-specific document formats (scanned PDFs, handwritten notes) remain under-tested by all benchmarks.
How we weight
Multimodal & Grounded carries a 12% weight in BenchLM.ai's overall scoring. It remains important for enterprise copilots and document-heavy workflows where models need to interpret visuals, screenshots, and scanned artifacts.
This category tests whether a model can read the world as it actually appears in products: screenshots, charts, scanned documents, and mixed visual-text artifacts. See the multimodal leaderboard or compare with knowledge benchmarks.
Leaderboards exclude benchmark rows that BenchLM generated from other scores or cloned from reference models. When a weighted benchmark is missing after that filter, the category falls back to the remaining trustworthy public rows instead of filling the gap with synthetic values.
The full scoring rules, freshness handling, and runtime/pricing caveats live on the BenchLM methodology page.
| Benchmark | Weight | Status | Description |
|---|---|---|---|
| MMMU-Pro | 45% | Weighted | Frontier multimodal reasoning benchmark spanning charts, diagrams, tables, and visual academic problems |
| OfficeQA Pro | 30% | Weighted | Grounded office and enterprise document benchmark |
| MMMU-Pro w/ Python | — | Display only | Tool-augmented MMMU-Pro variant that allows Python assistance during multimodal reasoning |
| OmniDocBench 1.5 | — | Display only | Document understanding benchmark measured by edit distance on complex document extraction tasks |
| GDPval-AA | — | Display only | An evaluation focused on professional domain expertise and task delivery quality in office-style knowledge work. |
| MedXpertQA (MM) | — | Display only | A clinically grounded multimodal medical multiple-choice benchmark with image inputs. |
| ZeroBench | — | Display only | A multi-step visual reasoning benchmark with pass@5 reporting and optional tool use. |
| Design2Code | — | Display only | Multimodal coding benchmark for turning visual designs into working frontend implementations. |
| Flame-VLM-Code | — | Display only | Vision-language coding benchmark for generating correct code from visual and multimodal inputs. |
| Vision2Web | — | Display only | Benchmark for converting visual references into functional web implementations. |
| ImageMining | — | Display only | Multimodal retrieval and extraction benchmark over image-heavy task settings. |
| MMSearch | — | Display only | Multimodal search benchmark for retrieval and grounded answering across mixed-media inputs. |
| MMSearch-Plus | — | Display only | A harder MMSearch variant for multimodal retrieval and grounded tool-use workflows. |
| SimpleVQA | — | Display only | Visual question answering benchmark focused on straightforward image-grounded understanding. |
| Facts-VLM | — | Display only | Grounded multimodal factuality benchmark for evidence-linked answer correctness. |
| V* | — | Display only | Vision-centric benchmark for high-level multimodal reasoning and perception quality. |
Multimodal benchmark updates
Multimodal rankings are heating up. Get the weekly update.
Free. No spam. Unsubscribe anytime.
About Multimodal & Grounded Benchmarks
Frontier multimodal reasoning benchmark spanning charts, diagrams, tables, and visual academic problems