A benchmark measuring fluid intelligence and novel abstract reasoning through visual grid puzzles. Models must identify patterns in input-output pairs and generate the correct output for unseen inputs. Considered the hardest public reasoning benchmark — average individual human performance is 66%.
As of May 1, 2026, GPT-5.5 leads the ARC-AGI-2 leaderboard with 85% , followed by GPT-5.4 Pro (83.3%) and Gemini 3.1 Pro (77.1%).
GPT-5.5
OpenAI
GPT-5.4 Pro
OpenAI
Gemini 3.1 Pro
According to BenchLM.ai, GPT-5.5 leads the ARC-AGI-2 benchmark with a score of 85%, followed by GPT-5.4 Pro (83.3%) and Gemini 3.1 Pro (77.1%). There is significant spread across the leaderboard, making this benchmark effective at differentiating model capabilities.
10 models have been evaluated on ARC-AGI-2. The benchmark falls in the Reasoning category. This category carries a 17% weight in BenchLM.ai's overall scoring system. Within that category, ARC-AGI-2 contributes 25% of the category score, so strong performance here directly affects a model's overall ranking.
Year
2025
Tasks
Visual pattern completion and abstract reasoning
Format
Grid transformation puzzles with novel rules
Difficulty
Expert-level — hardest public reasoning benchmark
ARC-AGI-2 extends the original ARC benchmark with harder puzzles designed to test genuine fluid intelligence. Four major AI labs (Anthropic, Google, OpenAI, xAI) now report their model performance on this benchmark. Average individual human performance is 66%, the human panel completion rate is 100%, and the grand prize threshold is greater than 85%. Top frontier models reach 75-85 in BenchLM's tracked data, making it one of the few benchmarks that still separates current reasoning systems.
Version
ARC-AGI 2
Refresh cadence
Static
Staleness state
Current
Question availability
Public benchmark set
BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.
A benchmark measuring fluid intelligence and novel abstract reasoning through visual grid puzzles. Models must identify patterns in input-output pairs and generate the correct output for unseen inputs. Considered the hardest public reasoning benchmark — average individual human performance is 66%.
GPT-5.5 by OpenAI currently leads with a score of 85% on ARC-AGI-2.
10 AI models have been evaluated on ARC-AGI-2 on BenchLM.
For engineers, researchers, and the plain curious — a weekly brief on new models, ranking shifts, and pricing changes.
Free. No spam. Unsubscribe anytime.