Skip to main content

Abstraction and Reasoning Corpus for AGI v2 (ARC-AGI-2)

A benchmark measuring fluid intelligence and novel abstract reasoning through visual grid puzzles. Models must identify patterns in input-output pairs and generate the correct output for unseen inputs. Considered the hardest public reasoning benchmark — human average is 60%.

Top models on ARC-AGI-2 — April 10, 2026

As of April 10, 2026, GPT-5.4 Pro leads the ARC-AGI-2 leaderboard with 83.3% , followed by Gemini 3.1 Pro (77.1%) and Grok 4.20 (53.3%).

8 modelsReasoning25% of category scoreCurrentUpdated April 10, 2026

According to BenchLM.ai, GPT-5.4 Pro leads the ARC-AGI-2 benchmark with a score of 83.3%, followed by Gemini 3.1 Pro (77.1%) and Grok 4.20 (53.3%). There is significant spread across the leaderboard, making this benchmark effective at differentiating model capabilities.

8 models have been evaluated on ARC-AGI-2. The benchmark falls in the Reasoning category. This category carries a 17% weight in BenchLM.ai's overall scoring system. Within that category, ARC-AGI-2 contributes 25% of the category score, so strong performance here directly affects a model's overall ranking.

About ARC-AGI-2

Year

2025

Tasks

Visual pattern completion and abstract reasoning

Format

Grid transformation puzzles with novel rules

Difficulty

Expert-level — hardest public reasoning benchmark

ARC-AGI-2 extends the original ARC benchmark with harder puzzles designed to test genuine fluid intelligence. Four major AI labs (Anthropic, Google, OpenAI, xAI) now report their model performance on this benchmark. The human baseline is 60% and the grand prize threshold is 85%. Top frontier models reach 68-77%, making it one of the few benchmarks where AI has not yet saturated human performance.

BenchLM freshness & provenance

Version

ARC-AGI 2

Refresh cadence

Static

Staleness state

Current

Question availability

Public benchmark set

Current

BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.

Leaderboard (8 models)

1
83.3%
2
77.1%
3
53.3%
4
52.9%
5
45.1%
6
42.5%
7
31.1%
8
13.6%

FAQ

What does ARC-AGI-2 measure?

A benchmark measuring fluid intelligence and novel abstract reasoning through visual grid puzzles. Models must identify patterns in input-output pairs and generate the correct output for unseen inputs. Considered the hardest public reasoning benchmark — human average is 60%.

Which model scores highest on ARC-AGI-2?

GPT-5.4 Pro by OpenAI currently leads with a score of 83.3% on ARC-AGI-2.

How many models are evaluated on ARC-AGI-2?

8 AI models have been evaluated on ARC-AGI-2 on BenchLM.

Last updated: April 10, 2026 · BenchLM version ARC-AGI 2

The AI models change fast. We track them for you.

For engineers, researchers, and the plain curious — a weekly brief on new models, ranking shifts, and pricing changes.

Free. No spam. Unsubscribe anytime.