Skip to main content
Skip to main content
Multimodal & Grounded

Multimodal & Grounded Benchmarks — MMMU-Pro, OfficeQA & CharXiv Leaderboard

Vision, document, and grounded enterprise workflow benchmarks

Bottom line: Multimodal is one of the fastest-evolving categories. Models that can read screenshots, charts, and documents are essential for enterprise copilots.

MMMU-Pro · OfficeQA Pro · MMMU-Pro w/ Python · OmniDocBench 1.5 · GDPval-AA · MedXpertQA (MM) · ZeroBench · Design2Code · Flame-VLM-Code · Vision2Web · ImageMining · MMSearch · MMSearch-Plus · SimpleVQA · Facts-VLM · V*

VisionDocument-officeGUI/webVideo

Best Multimodal & Grounded picks

BenchLM summaries for multimodal & grounded plus the practical tradeoffs users check next: open weights, price, speed, latency, and context.

Top AI Models for Multimodal & GroundedApril 2026

As of April 2026, Gemini 3 Pro Deep Think leads the provisional multimodal & grounded leaderboard with a weighted score of 100.0%, followed by Grok 4.1 (97.8%) and Claude Mythos Preview (97.6%). BenchLM is currently showing 105 provisional-ranked models and 16 verified-ranked models in this category.

What changed

Claude Mythos Preview leads multimodal with the strongest MMMU-Pro score.

GPT-5.4 close behind with strong OfficeQA Pro and MMMU-Pro results.

Claude Opus 4.7 adds official CharXiv visual reasoning coverage.

How to choose

Top models by benchmark

Frontier multimodal reasoning benchmark spanning charts, diagrams, tables, and visual academic problems(45% of category score)

Multimodal & Grounded Leaderboard

Updated April 24, 2026

Sorted by multimodal & grounded weighted score. Switch between provisional-ranked and verified-ranked modes to see the broader public dataset versus sourced-only ranking. Click column headers to re-sort by overall score or any benchmark.

105 ranked models
CSVJSON
Provisional-ranked mode includes source-unverified non-generated benchmark evidence.P = provisional benchmark row
100%
Est.90
97.8%
Est.90
97.6%
99
92.7%
4
GPT-5.1
OpenAI
96.3%
Est.79
94.8%
Est.88
94.8%
Est.66
92.3%
Est.78
89.6%
Est.72
89.2%
Est.77
88.7%
Est.70
88.2%
Est.78
84.3%
Est.65
83.8%
83
81.6%
92
83.9%132081.3%29.0%72.4%
15
GPT-5.2
OpenAI
79.8%
81
79.5%75.9%
16
79.2%
81
81%88.0%
77.5%
82
80.4%144478.4%33.0%71.3%
18
76.5%
Est.52
19
74.7%
Est.51
20
73.6%
87
77.3%160664.8%
21
72.8%
Est.58
22
72.2%
Est.65
23
GLM-5 (Reasoning)
Z.AI
Self-host
71.9%
Est.83
69.8%
Est.65
25
68.9%
74
78.8%96.9%
Showing 25 of 105

These rankings update weekly

Get notified when models move. One email a week with what changed and why.

Free. No spam. Unsubscribe anytime.

Score in Context

What these scores mean

Multimodal & Grounded carries a 12% weight in overall scoring. The weighted score blends MMMU-Pro (academic multimodal reasoning), OfficeQA Pro (enterprise document understanding), and CharXiv (visual chart and figure reasoning). A model can know facts in text and still fail when the information is in a chart, screenshot, or spreadsheet — this category measures that gap.

Known limitations

Not all models support image input — text-only models are excluded from this category entirely. OfficeQA Pro and CharXiv coverage is still building, so rankings should be read as a blend of available public evidence rather than a complete visual capability profile. Enterprise-specific document formats (scanned PDFs, handwritten notes) remain under-tested by all benchmarks.

How we weight

Multimodal & Grounded carries a 12% weight in BenchLM.ai's overall scoring. It remains important for enterprise copilots and document-heavy workflows where models need to interpret visuals, screenshots, and scanned artifacts.

This category tests whether a model can read the world as it actually appears in products: screenshots, charts, scanned documents, and mixed visual-text artifacts. See the multimodal leaderboard or compare with knowledge benchmarks.

Leaderboards exclude benchmark rows that BenchLM generated from other scores or cloned from reference models. When a weighted benchmark is missing after that filter, the category falls back to the remaining trustworthy public rows instead of filling the gap with synthetic values.

The full scoring rules, freshness handling, and runtime/pricing caveats live on the BenchLM methodology page.

BenchmarkWeightStatusDescription
MMMU-Pro45%WeightedFrontier multimodal reasoning benchmark spanning charts, diagrams, tables, and visual academic problems
OfficeQA Pro30%WeightedGrounded office and enterprise document benchmark
MMMU-Pro w/ PythonDisplay onlyTool-augmented MMMU-Pro variant that allows Python assistance during multimodal reasoning
OmniDocBench 1.5Display onlyDocument understanding benchmark measured by edit distance on complex document extraction tasks
GDPval-AADisplay onlyAn evaluation focused on professional domain expertise and task delivery quality in office-style knowledge work.
MedXpertQA (MM)Display onlyA clinically grounded multimodal medical multiple-choice benchmark with image inputs.
ZeroBenchDisplay onlyA multi-step visual reasoning benchmark with pass@5 reporting and optional tool use.
Design2CodeDisplay onlyMultimodal coding benchmark for turning visual designs into working frontend implementations.
Flame-VLM-CodeDisplay onlyVision-language coding benchmark for generating correct code from visual and multimodal inputs.
Vision2WebDisplay onlyBenchmark for converting visual references into functional web implementations.
ImageMiningDisplay onlyMultimodal retrieval and extraction benchmark over image-heavy task settings.
MMSearchDisplay onlyMultimodal search benchmark for retrieval and grounded answering across mixed-media inputs.
MMSearch-PlusDisplay onlyA harder MMSearch variant for multimodal retrieval and grounded tool-use workflows.
SimpleVQADisplay onlyVisual question answering benchmark focused on straightforward image-grounded understanding.
Facts-VLMDisplay onlyGrounded multimodal factuality benchmark for evidence-linked answer correctness.
V*Display onlyVision-centric benchmark for high-level multimodal reasoning and perception quality.

Multimodal benchmark updates

Multimodal rankings are heating up. Get the weekly update.

Free. No spam. Unsubscribe anytime.

About Multimodal & Grounded Benchmarks

Frontier multimodal reasoning benchmark spanning charts, diagrams, tables, and visual academic problems

Related