Skip to main content
Skip to main content
Multimodal & Grounded

Multimodal & Grounded Benchmarks — MMMU-Pro & OfficeQA Leaderboard

Vision, document, and grounded enterprise workflow benchmarks

Bottom line: Multimodal is one of the fastest-evolving categories. Models that can read screenshots, charts, and documents are essential for enterprise copilots.

MMMU-Pro · OfficeQA Pro · MMMU-Pro w/ Python · OmniDocBench 1.5 · GDPval-AA · MedXpertQA (MM) · ZeroBench · Design2Code · Flame-VLM-Code · Vision2Web · ImageMining · MMSearch · MMSearch-Plus · SimpleVQA · Facts-VLM · V*

VisionDocument-officeGUI/webVideo

Best Multimodal & Grounded picks

BenchLM summaries for multimodal & grounded plus the practical tradeoffs users check next: open weights, price, speed, latency, and context.

Top AI Models for Multimodal & GroundedApril 2026

As of April 2026, GPT-5.4 Pro leads the provisional multimodal & grounded leaderboard with a weighted score of 100.0%, followed by Gemini 3 Pro Deep Think (100.0%) and Claude Mythos Preview (97.8%). BenchLM is currently showing 109 provisional-ranked models and 18 verified-ranked models in this category.

What changed

Claude Mythos Preview leads multimodal with the strongest MMMU-Pro score.

GPT-5.4 close second, with strong OfficeQA Pro performance.

Claude Opus 4.6 holds #3, consistent across all multimodal sub-tasks.

How to choose

Top models by benchmark

Frontier multimodal reasoning benchmark spanning charts, diagrams, tables, and visual academic problems(55% of category score)

Multimodal & Grounded Leaderboard

Updated April 10, 2026

Sorted by multimodal & grounded weighted score. Switch between provisional-ranked and verified-ranked modes to see the broader public dataset versus sourced-only ranking. Click column headers to re-sort by overall score or any benchmark.

109 ranked models
CSVJSON
Provisional-ranked mode includes source-unverified non-generated benchmark evidence.P = provisional benchmark row
100%
92
94%
100%
Est.87
97.8%
99
92.7%
97.5%
Est.81
5
GPT-5.1
OpenAI
95.8%
Est.81
95.3%
Est.89
95%
86
94.2%
Est.68
91.6%
Est.80
90.4%
94
83.9%132081.3%29.0%72.4%
89.8%
Est.79
89.4%
Est.74
88.9%
Est.80
88%
Est.72
15
GPT-5.4
OpenAI
87.9%
94
81.2%167277.1%41.0%61.1%
16
GPT-5.2
OpenAI
86.3%
Est.84
79.5%
17
86%
Est.83
81%
18
84.2%
92
77.3%160664.8%
84.1%
Est.67
76.6%
80
80.4%144478.4%33.0%71.3%
21
76.1%
Est.53
22
74.3%
Est.52
74.2%
Est.67
24
73.8%
77
78.8%
72.7%
Est.85
Showing 25 of 109

These rankings update weekly

Get notified when models move. One email a week with what changed and why.

Free. No spam. Unsubscribe anytime.

Score in Context

What these scores mean

Multimodal & Grounded carries a 12% weight in overall scoring. The weighted score blends MMMU-Pro (academic multimodal reasoning) and OfficeQA Pro (enterprise document understanding). A model can know facts in text and still fail when the information is in a chart, screenshot, or spreadsheet — this category measures that gap.

Known limitations

Not all models support image input — text-only models are excluded from this category entirely. OfficeQA Pro is relatively new and coverage is still building. Enterprise-specific document formats (scanned PDFs, handwritten notes) remain under-tested by all benchmarks.

How we weight

Multimodal & Grounded carries a 12% weight in BenchLM.ai's overall scoring. It remains important for enterprise copilots and document-heavy workflows where models need to interpret visuals, screenshots, and scanned artifacts.

This category tests whether a model can read the world as it actually appears in products: screenshots, charts, scanned documents, and mixed visual-text artifacts. See the multimodal leaderboard or compare with knowledge benchmarks.

Leaderboards exclude benchmark rows that BenchLM generated from other scores or cloned from reference models. When a weighted benchmark is missing after that filter, the category falls back to the remaining trustworthy public rows instead of filling the gap with synthetic values.

The full scoring rules, freshness handling, and runtime/pricing caveats live on the BenchLM methodology page.

BenchmarkWeightStatusDescription
MMMU-Pro55%WeightedFrontier multimodal reasoning benchmark spanning charts, diagrams, tables, and visual academic problems
OfficeQA Pro45%WeightedGrounded office and enterprise document benchmark
MMMU-Pro w/ PythonDisplay onlyTool-augmented MMMU-Pro variant that allows Python assistance during multimodal reasoning
OmniDocBench 1.5Display onlyDocument understanding benchmark measured by edit distance on complex document extraction tasks
GDPval-AADisplay onlyAn evaluation focused on professional domain expertise and task delivery quality in office-style knowledge work.
MedXpertQA (MM)Display onlyA clinically grounded multimodal medical multiple-choice benchmark with image inputs.
ZeroBenchDisplay onlyA multi-step visual reasoning benchmark with pass@5 reporting and optional tool use.
Design2CodeDisplay onlyMultimodal coding benchmark for turning visual designs into working frontend implementations.
Flame-VLM-CodeDisplay onlyVision-language coding benchmark for generating correct code from visual and multimodal inputs.
Vision2WebDisplay onlyBenchmark for converting visual references into functional web implementations.
ImageMiningDisplay onlyMultimodal retrieval and extraction benchmark over image-heavy task settings.
MMSearchDisplay onlyMultimodal search benchmark for retrieval and grounded answering across mixed-media inputs.
MMSearch-PlusDisplay onlyA harder MMSearch variant for multimodal retrieval and grounded tool-use workflows.
SimpleVQADisplay onlyVisual question answering benchmark focused on straightforward image-grounded understanding.
Facts-VLMDisplay onlyGrounded multimodal factuality benchmark for evidence-linked answer correctness.
V*Display onlyVision-centric benchmark for high-level multimodal reasoning and perception quality.

Multimodal benchmark updates

Multimodal rankings are heating up. Get the weekly update.

Free. No spam. Unsubscribe anytime.

About Multimodal & Grounded Benchmarks

Frontier multimodal reasoning benchmark spanning charts, diagrams, tables, and visual academic problems

Related