What do multimodal and grounded LLM benchmarks measure?

They measure whether models can reason over images, charts, documents, and office artifacts instead of only plain-text prompts.

Which benchmarks matter most here?

MMMU-Pro is a strong frontier multimodal reasoning benchmark, OfficeQA Pro is useful for grounded document and office-style enterprise workflows, and CharXiv tests chart and figure reasoning.

Why is this category separate from knowledge?

A model can know facts in text and still be weak when the information is embedded in a chart, screenshot, PDF, or spreadsheet. This category measures that gap directly.

Multimodal & Grounded

Multimodal & Grounded Benchmarks — MMMU-Pro, OfficeQA & CharXiv Leaderboard

Name: Multimodal & Grounded Benchmarks — LLM Leaderboard
Creator: BenchLM.ai

Vision, document, and grounded enterprise workflow benchmarks

Bottom line: Multimodal is one of the fastest-evolving categories. Models that can read screenshots, charts, and documents are essential for enterprise copilots.

MMMU-Pro · OfficeQA Pro · MMMU-Pro w/ Python · OmniDocBench 1.5 · GDPval-AA · MedXpertQA (MM) · ZeroBench · Design2Code · Flame-VLM-Code · Vision2Web · ImageMining · MMSearch · MMSearch-Plus · SimpleVQA · Facts-VLM · V*

VisionDocument-officeGUI/webVideo

Best Multimodal & Grounded picks

BenchLM summaries for multimodal & grounded plus the practical tradeoffs users check next: open weights, price, speed, latency, and context.

How BenchLM scores these

Best Multimodal

Gemini 3 Pro Deep Think

100

category score

Google

Best Open Weight

DeepSeek V4 Pro (Max)

overall score

DeepSeek

Cheapest

Qwen3.6-27B

$0.00

avg / 1M tokens

Alibaba

Fastest

Mercury 2

789

tokens / sec

Inception

Lowest Latency

LFM2-24B-A2B

0.42s

TTFT

LiquidAI

Largest Context

Nemotron 3 Ultra 500B

10M

context window

NVIDIA

Top AI Models for Multimodal & Grounded — April 2026

As of April 2026, Gemini 3 Pro Deep Think leads the provisional multimodal & grounded leaderboard with a weighted score of 100.0%, followed by Grok 4.1 (97.8%) and Claude Mythos Preview (97.6%). BenchLM is currently showing 105 provisional-ranked models and 16 verified-ranked models in this category.

1Proprietary

Gemini 3 Pro Deep Think

Google

100.0%weighted

2Proprietary

Grok 4.1

xAI

97.8%weighted

3Proprietary

Claude Mythos Preview

Anthropic

97.6%weighted

Best multimodal reasoning. Top MMMU-Pro for academic and scientific visuals.

MMMU-Pro 92.7charxiv 93.2charxivNoTools 86.1

105 provisional-ranked16 verified-ranked16 benchmarksUpdated April 24, 2026

What changed

Claude Mythos Preview leads multimodal with the strongest MMMU-Pro score.

GPT-5.4 close behind with strong OfficeQA Pro and MMMU-Pro results.

Claude Opus 4.7 adds official CharXiv visual reasoning coverage.

How to choose

Document AI or enterprise copilots?

GPT-5.4 — best OfficeQA Pro scores

Scientific figure understanding?

Claude Mythos Preview — best MMMU-Pro

Screenshot and chart analysis?

Claude Opus 4.7 — strong CharXiv visual reasoning

Multimodal on a budget?

Gemini 3.1 Pro — strong multimodal at low cost

Top models by benchmark

Frontier multimodal reasoning benchmark spanning charts, diagrams, tables, and visual academic problems(45% of category score)

1GPT-5.4 Pro

2Claude Mythos Preview

92.7

3Gemini 3.1 Pro

83.9

4GPT-5.5

81.2

5GPT-5.4

81.2

Multimodal & Grounded Leaderboard

Updated April 24, 2026

Sorted by multimodal & grounded weighted score. Switch between provisional-ranked and verified-ranked modes to see the broader public dataset versus sourced-only ranking. Click column headers to re-sort by overall score or any benchmark.

105 ranked models

CSV JSON

Provisional-ranked mode includes source-unverified non-generated benchmark evidence.P = provisional benchmark row


1 Gemini 3 Pro Deep Think Google	Closed	Reasoning	2M	N/A	N/A	N/A	100%	Est.90	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—
2 Grok 4.1 xAI	Closed	Standard	1M	N/A	N/A	N/A	97.8%	Est.90	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—
3 Claude Mythos Preview Anthropic	Closed	Reasoning	1M	$25.00 / $125.00	N/A	N/A	97.6%	99	92.7%	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—
4 GPT-5.1 OpenAI	Closed	Reasoning	200K	$1.25 / $10.00	111	57.47s	96.3%	Est.79	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—
5 GPT-5.3 Codex OpenAI	Closed	Reasoning	400K	$1.75 / $14.00	79	88.26s	94.8%	Est.88	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—
6 Claude Sonnet 4.5 Anthropic	Closed	Standard	200K	$3.00 / $15.00	N/A	N/A	94.8%	Est.66	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—
7 GPT-5 (high) OpenAI	Closed	Reasoning	128K	$1.25 / $10.00	83	36.28s	92.3%	Est.78	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—
8 GPT-5 (medium) OpenAI	Closed	Reasoning	128K	N/A	83	36.28s	89.6%	Est.72	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—
9 GPT-5.1-Codex-Max OpenAI	Closed	Reasoning	400K	$1.25 / $10.00	N/A	N/A	89.2%	Est.77	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—
10 Grok 4.1 Fast xAI	Closed	Standard	1M	$0.20 / $0.50	138	0.54s	88.7%	Est.70	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—
11 GPT-5.2-Codex OpenAI	Closed	Reasoning	400K	$1.75 / $14.00	123	87.34s	88.2%	Est.78	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—
12 Gemini 2.5 Pro Google	Closed	Standard	1M	$1.25 / $10.00	117	21.19s	84.3%	Est.65	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—
13 Claude Sonnet 4.6 Anthropic	Closed	Standard	200K	$3.00 / $15.00	44	1.48s	83.8%	83	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—
14 Gemini 3.1 Pro Google	Closed	Standard	1M	$2.00 / $12.00	109	29.71s	81.6%	92	83.9%	—	—	—	1320	81.3%	29.0%	—	—	—	—	—	—	72.4%	—	—
15 GPT-5.2 OpenAI	Closed	Reasoning	400K	$1.75 / $14.00	73	130.34s	79.8%	81	79.5%	—	—	—	—	—	—	—	—	—	—	—	—	—	—	75.9%
16 Gemini 3 Pro Google	Closed	Standard	2M	$2.00 / $12.00	109	32.65s	79.2%	81	81%	—	—	—	—	—	—	—	—	—	—	—	—	—	—	88.0%
17 Muse Spark Meta	Closed	Reasoning	262K	N/A	N/A	N/A	77.5%	82	80.4%	—	—	—	1444	78.4%	33.0%	—	—	—	—	—	—	71.3%	—	—
18 Claude 4.1 Opus Anthropic	Closed	Standard	200K	$15.00 / $75.00	29	1.66s	76.5%	Est.52	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—
19 Claude 4 Sonnet Anthropic	Closed	Standard	200K	$3.00 / $15.00	40	1.33s	74.7%	Est.51	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—
20 Claude Opus 4.6 Anthropic	Closed	Standard	1M	$5.00 / $25.00	40	1.78s	73.6%	87	77.3%	—	—	—	1606	64.8%	—	—	—	—	—	—	—	—	—	—
21 Claude Haiku 4.5 Anthropic	Closed	Standard	200K	$1.00 / $5.00	N/A	N/A	72.8%	Est.58	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—
22 Grok 4 xAI	Closed	Standard	128K	N/A	54	15.60s	72.2%	Est.65	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—
23 GLM-5 (Reasoning) Z.AI Self-host	Open	Reasoning	200K	$1.00 / $3.20	N/A	N/A	71.9%	Est.83	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—
24 Gemini 3 Flash Google	Closed	Standard	1M	$0.50 / $3.00	159	1.19s	69.8%	Est.65	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—
25 Qwen3.6 Plus Alibaba	Closed	Reasoning	1M	N/A	N/A	N/A	68.9%	74	78.8%	—	—	—	—	—	—	—	—	—	—	—	—	—	—	96.9%

Showing 25 of 105

These rankings update weekly

Get notified when models move. One email a week with what changed and why.

Free. No spam. Unsubscribe anytime.

Score in Context

What these scores mean

Multimodal & Grounded carries a 12% weight in overall scoring. The weighted score blends MMMU-Pro (academic multimodal reasoning), OfficeQA Pro (enterprise document understanding), and CharXiv (visual chart and figure reasoning). A model can know facts in text and still fail when the information is in a chart, screenshot, or spreadsheet — this category measures that gap.

Known limitations

Not all models support image input — text-only models are excluded from this category entirely. OfficeQA Pro and CharXiv coverage is still building, so rankings should be read as a blend of available public evidence rather than a complete visual capability profile. Enterprise-specific document formats (scanned PDFs, handwritten notes) remain under-tested by all benchmarks.

How we weight

Multimodal & Grounded carries a 12% weight in BenchLM.ai's overall scoring. It remains important for enterprise copilots and document-heavy workflows where models need to interpret visuals, screenshots, and scanned artifacts.

This category tests whether a model can read the world as it actually appears in products: screenshots, charts, scanned documents, and mixed visual-text artifacts. See the multimodal leaderboard or compare with knowledge benchmarks.

Leaderboards exclude benchmark rows that BenchLM generated from other scores or cloned from reference models. When a weighted benchmark is missing after that filter, the category falls back to the remaining trustworthy public rows instead of filling the gap with synthetic values.

The full scoring rules, freshness handling, and runtime/pricing caveats live on the BenchLM methodology page.

Benchmark	Weight	Status	Description
MMMU-Pro	45%	Weighted	Frontier multimodal reasoning benchmark spanning charts, diagrams, tables, and visual academic problems
OfficeQA Pro	30%	Weighted	Grounded office and enterprise document benchmark
MMMU-Pro w/ Python	—	Display only	Tool-augmented MMMU-Pro variant that allows Python assistance during multimodal reasoning
OmniDocBench 1.5	—	Display only	Document understanding benchmark measured by edit distance on complex document extraction tasks
GDPval-AA	—	Display only	An evaluation focused on professional domain expertise and task delivery quality in office-style knowledge work.
MedXpertQA (MM)	—	Display only	A clinically grounded multimodal medical multiple-choice benchmark with image inputs.
ZeroBench	—	Display only	A multi-step visual reasoning benchmark with pass@5 reporting and optional tool use.
Design2Code	—	Display only	Multimodal coding benchmark for turning visual designs into working frontend implementations.
Flame-VLM-Code	—	Display only	Vision-language coding benchmark for generating correct code from visual and multimodal inputs.
Vision2Web	—	Display only	Benchmark for converting visual references into functional web implementations.
ImageMining	—	Display only	Multimodal retrieval and extraction benchmark over image-heavy task settings.
MMSearch	—	Display only	Multimodal search benchmark for retrieval and grounded answering across mixed-media inputs.
MMSearch-Plus	—	Display only	A harder MMSearch variant for multimodal retrieval and grounded tool-use workflows.
SimpleVQA	—	Display only	Visual question answering benchmark focused on straightforward image-grounded understanding.
Facts-VLM	—	Display only	Grounded multimodal factuality benchmark for evidence-linked answer correctness.
V*	—	Display only	Vision-centric benchmark for high-level multimodal reasoning and perception quality.

Multimodal benchmark updates

Multimodal rankings are heating up. Get the weekly update.

Free. No spam. Unsubscribe anytime.

About Multimodal & Grounded Benchmarks

Frontier multimodal reasoning benchmark spanning charts, diagrams, tables, and visual academic problems

Best LLMs Overall

Top models ranked across all benchmark categories.

View

Knowledge Benchmarks

Factual recall and domain expertise leaderboard.

View

Agentic Benchmarks

How models perform on autonomous tasks.

View

AI Cost Calculator

Compare pricing for multimodal model usage.

View

Multimodal & Grounded Benchmarks — MMMU-Pro, OfficeQA & CharXiv Leaderboard

Best Multimodal & Grounded picks

Top AI Models for Multimodal & Grounded — April 2026

What changed

How to choose

Top models by benchmark

Multimodal & Grounded Leaderboard

These rankings update weekly

Score in Context

What these scores mean

Known limitations

How we weight

Multimodal benchmark updates

About Multimodal & Grounded Benchmarks

Related

Stay ahead of the LLM curve