CharXiv Reasoning without tools (CharXiv w/o tools)

Name: CharXiv Reasoning without tools
Creator: BenchLM

Tool-free variant of CharXiv that isolates raw visual reasoning ability without code execution or tool augmentation.

Benchmark score on CharXiv w/o tools — July 6, 2026

BenchLM mirrors the published score view for CharXiv w/o tools. Claude Mythos 5 leads the public snapshot at 88.9% , followed by Claude Fable 5 (86.1%) and Claude Opus 4.7 (Adaptive) (82.1%). BenchLM does not use these results to rank models overall.

1Closed

Claude Mythos 5

Anthropic

88.9%

Overall 90Context 1M+

2Closed

Claude Fable 5

Anthropic

86.1%

Overall 92Context 1M+

3Closed

Claude Opus 4.7 (Adaptive)

Anthropic

82.1%

Overall 77Context 1M

5 modelsMultimodal & GroundedRefreshingDisplay onlyUpdated July 6, 2026

The published CharXiv w/o tools snapshot is tightly clustered at the top: Claude Mythos 5 sits at 88.9%, while the third row is only 6.8 points behind. The broader top-10 spread is 11.9 points, so the benchmark still separates strong models even when the leaders cluster.

5 models have been evaluated on CharXiv w/o tools. The benchmark falls in the Multimodal & Grounded category. This category carries a 12% weight in BenchLM.ai's overall scoring system. CharXiv w/o tools is currently displayed for reference but excluded from the scoring formula, so it does not directly affect overall rankings.

About CharXiv w/o tools

Year

2024

Tasks

Scientific chart reasoning (tool-free)

Format

Chart understanding without tools

Difficulty

Scientific visualization reasoning

The tool-free CharXiv variant measures pure multimodal reasoning. Mythos Preview scores 86.1% without tools vs 93.2% with tools, demonstrating strong baseline chart reasoning.

CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs

BenchLM freshness & provenance

Version

CharXiv w/o tools 2024

Refresh cadence

Annual

Staleness state

Refreshing

Question availability

Public benchmark set

RefreshingDisplay only

BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.