A benchmark measuring fluid intelligence and novel abstract reasoning through visual grid puzzles. Models must identify patterns in input-output pairs and generate the correct output for unseen inputs. Considered the hardest public reasoning benchmark — human average is 60%.
As of March 2026, Gemini 3.1 Pro leads the ARC-AGI-2 leaderboard with 77.1% , followed by GPT-5.4 (73.3%) and Claude Opus 4.6 (68.8%).
Gemini 3.1 Pro
GPT-5.4
OpenAI
Claude Opus 4.6
Anthropic
According to BenchLM.ai, Gemini 3.1 Pro leads the ARC-AGI-2 benchmark with a score of 77.1%, followed by GPT-5.4 (73.3%) and Claude Opus 4.6 (68.8%). There is significant spread across the leaderboard, making this benchmark effective at differentiating model capabilities.
17 models have been evaluated on ARC-AGI-2. The benchmark falls in the Reasoning category, which carries a 17% weight in BenchLM.ai's overall scoring system. Within that category, ARC-AGI-2 contributes 25% of the category score, so strong performance here directly affects a model's overall ranking.
Year
2025
Tasks
Visual pattern completion and abstract reasoning
Format
Grid transformation puzzles with novel rules
Difficulty
Expert-level — hardest public reasoning benchmark
ARC-AGI-2 extends the original ARC benchmark with harder puzzles designed to test genuine fluid intelligence. Four major AI labs (Anthropic, Google, OpenAI, xAI) now report their model performance on this benchmark. The human baseline is 60% and the grand prize threshold is 85%. Top frontier models reach 68-77%, making it one of the few benchmarks where AI has not yet saturated human performance.
ARC-AGI-2: A Harder General Intelligence BenchmarkA benchmark measuring fluid intelligence and novel abstract reasoning through visual grid puzzles. Models must identify patterns in input-output pairs and generate the correct output for unseen inputs. Considered the hardest public reasoning benchmark — human average is 60%.
Gemini 3.1 Pro by Google currently leads with a score of 77.1% on ARC-AGI-2.
17 AI models have been evaluated on ARC-AGI-2 on BenchLM.
Get notified when new models drop, benchmark scores change, or the leaderboard shifts. One email per week.
Free. No spam. Unsubscribe anytime. We only store derived location metadata for consent routing.