CharXiv Reasoning (CharXiv)

Name: CharXiv Reasoning
Creator: BenchLM

A scientific chart reasoning benchmark that tests whether models can understand, interpret, and reason about complex scientific visualizations including plots, diagrams, and data charts.

Top models on CharXiv — May 22, 2026

As of May 22, 2026, Claude Mythos Preview leads the CharXiv leaderboard with 93.2% , followed by Claude Opus 4.7 (Adaptive) (91%) and Muse Spark (86.4%).

1Closed

Claude Mythos Preview

Anthropic

93.2%

Overall 99Context 1M

2Closed

Claude Opus 4.7 (Adaptive)

Anthropic

91%

Overall 90Context 1M

3Closed

Muse Spark

About CharXiv

Year

2024

Tasks

Scientific chart reasoning

Format

Chart understanding and reasoning

Difficulty

Scientific visualization reasoning

CharXiv evaluates a model's ability to reason about real-world scientific charts rather than simple visual QA. With-tools and without-tools variants isolate raw visual reasoning from tool-augmented performance.

CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs

BenchLM freshness & provenance

Version

CharXiv 2024

Refresh cadence

Annual

Staleness state

Refreshing

Question availability

Public benchmark set

Refreshing

BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.

Leaderboard (21 models)

Claude Mythos Preview

AnthropicClosed

93.2%

Claude Opus 4.7 (Adaptive)

AnthropicClosed

91%

Muse Spark

MetaClosed

86.4%

Gemini 3.5 Flash

GoogleClosed

84.2%

GPT-5.4

OpenAIClosed

82.8%

GPT-5.2

OpenAIClosed

82.1%

Qwen3.6 Plus

AlibabaClosed

81.5%

Gemini 3 Pro

GoogleClosed

81.4%

MiMo-V2.5

XiaomiClosed

81%

Qwen3.5 397B

AlibabaOpen

80.8%

Kimi K2.6

Moonshot AIOpen

80.4%

Gemini 3.1 Pro

GoogleClosed

80.2%

Qwen3.6-27B

AlibabaOpen

78.4%

Qwen3.6-35B-A3B

AlibabaOpen

78%

Claude Sonnet 4.6

AnthropicClosed

77.4%

Qwen3.5-122B-A10B

AlibabaOpen

77.2%

Nemotron 3 Nano Omni 30B A3B

NVIDIAOpen

76.3%

Gemini 3.1 Flash-Lite

GoogleClosed

73.2%

Claude Opus 4.5

AnthropicClosed

68.5%

Grok 4.20

xAIClosed

60.9%

Command A+

CohereOpen

52.7%

FAQ

What does CharXiv measure?

A scientific chart reasoning benchmark that tests whether models can understand, interpret, and reason about complex scientific visualizations including plots, diagrams, and data charts.

Which model scores highest on CharXiv?

Claude Mythos Preview by Anthropic currently leads with a score of 93.2% on CharXiv.

How many models are evaluated on CharXiv?

21 AI models have been evaluated on CharXiv on BenchLM.

Compare Top Models on CharXiv

Claude Mythos Preview vs Claude Opus 4.7 (Adaptive)Claude Opus 4.7 (Adaptive) vs Muse Spark Muse Spark vs Gemini 3.5 Flash Gemini 3.5 Flash vs GPT-5.4

Last updated: May 22, 2026 · BenchLM version CharXiv 2024

The AI models change fast. We track them for you.

For engineers, researchers, and the plain curious — a weekly brief on new models, ranking shifts, and pricing changes.

Free. No spam. Unsubscribe anytime.