Abstraction and Reasoning Corpus for AGI v2 (ARC-AGI-2)

Name: Abstraction and Reasoning Corpus for AGI v2
Creator: BenchLM

A benchmark measuring fluid intelligence and novel abstract reasoning through visual grid puzzles. Models must identify patterns in input-output pairs and generate the correct output for unseen inputs. Considered the hardest public reasoning benchmark — average individual human performance is 66%.

Top models on ARC-AGI-2 — May 1, 2026

As of May 1, 2026, GPT-5.5 leads the ARC-AGI-2 leaderboard with 85% , followed by GPT-5.4 Pro (83.3%) and Gemini 3.1 Pro (77.1%).

1Closed

GPT-5.5

OpenAI

85%

Overall 91Context 1M

2Closed

GPT-5.4 Pro

OpenAI

83.3%

Overall 91Context 1.05M

3Closed

Gemini 3.1 Pro

Google

77.1%

Overall 92Context 1M

10 modelsReasoning25% of category scoreCurrentUpdated May 1, 2026

According to BenchLM.ai, GPT-5.5 leads the ARC-AGI-2 benchmark with a score of 85%, followed by GPT-5.4 Pro (83.3%) and Gemini 3.1 Pro (77.1%). There is significant spread across the leaderboard, making this benchmark effective at differentiating model capabilities.

10 models have been evaluated on ARC-AGI-2. The benchmark falls in the Reasoning category. This category carries a 17% weight in BenchLM.ai's overall scoring system. Within that category, ARC-AGI-2 contributes 25% of the category score, so strong performance here directly affects a model's overall ranking.

About ARC-AGI-2

Year

2025

Tasks

Visual pattern completion and abstract reasoning

Format

Grid transformation puzzles with novel rules

Difficulty

Expert-level — hardest public reasoning benchmark

ARC-AGI-2 extends the original ARC benchmark with harder puzzles designed to test genuine fluid intelligence. Four major AI labs (Anthropic, Google, OpenAI, xAI) now report their model performance on this benchmark. Average individual human performance is 66%, the human panel completion rate is 100%, and the grand prize threshold is greater than 85%. Top frontier models reach 75-85 in BenchLM's tracked data, making it one of the few benchmarks that still separates current reasoning systems.

ARC-AGI-2: A Harder General Intelligence Benchmark

BenchLM freshness & provenance

Version

ARC-AGI 2

Refresh cadence

Static

Staleness state

Current

Question availability

Public benchmark set

Current

BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.