Abstraction and Reasoning Corpus for AGI v2 (ARC-AGI-2)

A benchmark measuring fluid intelligence and novel abstract reasoning through visual grid puzzles. Models must identify patterns in input-output pairs and generate the correct output for unseen inputs. Considered the hardest public reasoning benchmark — human average is 60%.

Top Models on ARC-AGI-2 — March 2026

As of March 2026, Gemini 3.1 Pro leads the ARC-AGI-2 leaderboard with 77.1% , followed by GPT-5.4 (73.3%) and Claude Opus 4.6 (68.8%).

17 modelsReasoning25% of category scoreUpdated March 18, 2026

According to BenchLM.ai, Gemini 3.1 Pro leads the ARC-AGI-2 benchmark with a score of 77.1%, followed by GPT-5.4 (73.3%) and Claude Opus 4.6 (68.8%). There is significant spread across the leaderboard, making this benchmark effective at differentiating model capabilities.

17 models have been evaluated on ARC-AGI-2. The benchmark falls in the Reasoning category, which carries a 17% weight in BenchLM.ai's overall scoring system. Within that category, ARC-AGI-2 contributes 25% of the category score, so strong performance here directly affects a model's overall ranking.

About ARC-AGI-2

Year

2025

Tasks

Visual pattern completion and abstract reasoning

Format

Grid transformation puzzles with novel rules

Difficulty

Expert-level — hardest public reasoning benchmark

ARC-AGI-2 extends the original ARC benchmark with harder puzzles designed to test genuine fluid intelligence. Four major AI labs (Anthropic, Google, OpenAI, xAI) now report their model performance on this benchmark. The human baseline is 60% and the grand prize threshold is 85%. Top frontier models reach 68-77%, making it one of the few benchmarks where AI has not yet saturated human performance.

ARC-AGI-2: A Harder General Intelligence Benchmark

Leaderboard (17 models)

#1Gemini 3.1 Pro
77.1%
#2GPT-5.4
73.3%
#3Claude Opus 4.6
68.8%
#4Claude Sonnet 4.6
59%
#5GPT-5.2
52.9%
#7Claude Opus 4.5
37.6%
#8Gemini 3 Pro
31.1%
#9GPT-5.1
17.6%
#10Grok 4
16%
#11Claude Sonnet 4.5
13.6%
#12Gemini 2.5 Pro
4.9%
#14DeepSeek V3.2
4%
#15o3
3%
#16o4-mini (high)
2.4%
#17DeepSeek-R1
1.3%

FAQ

What does ARC-AGI-2 measure?

A benchmark measuring fluid intelligence and novel abstract reasoning through visual grid puzzles. Models must identify patterns in input-output pairs and generate the correct output for unseen inputs. Considered the hardest public reasoning benchmark — human average is 60%.

Which model scores highest on ARC-AGI-2?

Gemini 3.1 Pro by Google currently leads with a score of 77.1% on ARC-AGI-2.

How many models are evaluated on ARC-AGI-2?

17 AI models have been evaluated on ARC-AGI-2 on BenchLM.

Last updated: March 18, 2026

Weekly LLM Benchmark Digest

Get notified when new models drop, benchmark scores change, or the leaderboard shifts. One email per week.

Free. No spam. Unsubscribe anytime. We only store derived location metadata for consent routing.