A benchmark measuring fluid intelligence and novel abstract reasoning through visual grid puzzles. Models must identify patterns in input-output pairs and generate the correct output for unseen inputs. Considered the hardest public reasoning benchmark — human average is 60%.
As of April 10, 2026, GPT-5.4 Pro leads the ARC-AGI-2 leaderboard with 83.3% , followed by Gemini 3.1 Pro (77.1%) and Grok 4.20 (53.3%).
GPT-5.4 Pro
OpenAI
Gemini 3.1 Pro
Grok 4.20
xAI
According to BenchLM.ai, GPT-5.4 Pro leads the ARC-AGI-2 benchmark with a score of 83.3%, followed by Gemini 3.1 Pro (77.1%) and Grok 4.20 (53.3%). There is significant spread across the leaderboard, making this benchmark effective at differentiating model capabilities.
8 models have been evaluated on ARC-AGI-2. The benchmark falls in the Reasoning category. This category carries a 17% weight in BenchLM.ai's overall scoring system. Within that category, ARC-AGI-2 contributes 25% of the category score, so strong performance here directly affects a model's overall ranking.
Year
2025
Tasks
Visual pattern completion and abstract reasoning
Format
Grid transformation puzzles with novel rules
Difficulty
Expert-level — hardest public reasoning benchmark
ARC-AGI-2 extends the original ARC benchmark with harder puzzles designed to test genuine fluid intelligence. Four major AI labs (Anthropic, Google, OpenAI, xAI) now report their model performance on this benchmark. The human baseline is 60% and the grand prize threshold is 85%. Top frontier models reach 68-77%, making it one of the few benchmarks where AI has not yet saturated human performance.
Version
ARC-AGI 2
Refresh cadence
Static
Staleness state
Current
Question availability
Public benchmark set
BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.
A benchmark measuring fluid intelligence and novel abstract reasoning through visual grid puzzles. Models must identify patterns in input-output pairs and generate the correct output for unseen inputs. Considered the hardest public reasoning benchmark — human average is 60%.
GPT-5.4 Pro by OpenAI currently leads with a score of 83.3% on ARC-AGI-2.
8 AI models have been evaluated on ARC-AGI-2 on BenchLM.
For engineers, researchers, and the plain curious — a weekly brief on new models, ranking shifts, and pricing changes.
Free. No spam. Unsubscribe anytime.