A diagram understanding benchmark focused on scientific and educational visual question answering.
BenchLM mirrors the published score view for AI2D_TEST. Qwen3.6-35B-A3B leads the public snapshot at 92.7% , followed by Nemotron 3 Nano Omni 30B A3B (88.5%). BenchLM does not use these results to rank models overall.
Qwen3.6-35B-A3B
Alibaba
Nemotron 3 Nano Omni 30B A3B
NVIDIA
Year
2026
Tasks
Diagram understanding
Format
Diagram-grounded QA
Difficulty
Structured visual reasoning
AI2D-style tasks matter because diagrams compress structure differently from photos or office documents. They test whether a model can parse arrows, labels, and spatial relations in technical illustrations.
Version
AI2D_TEST 2026
Refresh cadence
Quarterly
Staleness state
Current
Question availability
Public benchmark set
BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.
A diagram understanding benchmark focused on scientific and educational visual question answering.
Qwen3.6-35B-A3B by Alibaba currently leads with a score of 92.7% on AI2D_TEST.
2 AI models have been evaluated on AI2D_TEST on BenchLM.
For engineers, researchers, and the plain curious — a weekly brief on new models, ranking shifts, and pricing changes.
Free. No spam. Unsubscribe anytime.