Skip to main content

Abstraction and Reasoning Corpus for AGI v2 (ARC-AGI-2)

A benchmark measuring fluid intelligence and novel abstract reasoning through visual grid puzzles. Models must identify patterns in input-output pairs and generate the correct output for unseen inputs. Considered the hardest public reasoning benchmark — average individual human performance is 66%.

Top models on ARC-AGI-2 — May 1, 2026

As of May 1, 2026, GPT-5.5 leads the ARC-AGI-2 leaderboard with 85% , followed by GPT-5.4 Pro (83.3%) and Gemini 3.1 Pro (77.1%).

10 modelsReasoning25% of category scoreCurrentUpdated May 1, 2026

According to BenchLM.ai, GPT-5.5 leads the ARC-AGI-2 benchmark with a score of 85%, followed by GPT-5.4 Pro (83.3%) and Gemini 3.1 Pro (77.1%). There is significant spread across the leaderboard, making this benchmark effective at differentiating model capabilities.

10 models have been evaluated on ARC-AGI-2. The benchmark falls in the Reasoning category. This category carries a 17% weight in BenchLM.ai's overall scoring system. Within that category, ARC-AGI-2 contributes 25% of the category score, so strong performance here directly affects a model's overall ranking.

About ARC-AGI-2

Year

2025

Tasks

Visual pattern completion and abstract reasoning

Format

Grid transformation puzzles with novel rules

Difficulty

Expert-level — hardest public reasoning benchmark

ARC-AGI-2 extends the original ARC benchmark with harder puzzles designed to test genuine fluid intelligence. Four major AI labs (Anthropic, Google, OpenAI, xAI) now report their model performance on this benchmark. Average individual human performance is 66%, the human panel completion rate is 100%, and the grand prize threshold is greater than 85%. Top frontier models reach 75-85 in BenchLM's tracked data, making it one of the few benchmarks that still separates current reasoning systems.

BenchLM freshness & provenance

Version

ARC-AGI 2

Refresh cadence

Static

Staleness state

Current

Question availability

Public benchmark set

Current

BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.

Leaderboard (10 models)

1
85%
2
83.3%
3
77.1%
4
75.8%
5
53.3%
6
52.9%
7
45.1%
8
42.5%
9
31.1%
10
13.6%

FAQ

What does ARC-AGI-2 measure?

A benchmark measuring fluid intelligence and novel abstract reasoning through visual grid puzzles. Models must identify patterns in input-output pairs and generate the correct output for unseen inputs. Considered the hardest public reasoning benchmark — average individual human performance is 66%.

Which model scores highest on ARC-AGI-2?

GPT-5.5 by OpenAI currently leads with a score of 85% on ARC-AGI-2.

How many models are evaluated on ARC-AGI-2?

10 AI models have been evaluated on ARC-AGI-2 on BenchLM.

Last updated: May 1, 2026 · BenchLM version ARC-AGI 2

The AI models change fast. We track them for you.

For engineers, researchers, and the plain curious — a weekly brief on new models, ranking shifts, and pricing changes.

Free. No spam. Unsubscribe anytime.