Benchmark profile

Agents Last Exam (ALE-Bench)

A benchmark for agentic professional workflows with verifiable success criteria, reporting pass rates and partial scores for model plus agent-harness rows.

How BenchLM shows ALE-Bench

BenchLM mirrors the Agents Last Exam full split from the public leaderboard API. The snapshot reports pass rate, partial average score, cost, token, and duration metadata across 152 professional workflow tasks for model plus agent-harness rows.

ALE-Bench is display only on BenchLM. Its rows combine a base model with an agent harness such as Codex, OpenClaw, Claude Code, Droid, Cursor CLI, or Gemini CLI, so BenchLM keeps the table separate from model-only rankings.

The Agent Showdown analysis adds domain and failure-mode context across 13 top-level domains. It also notes that Claude Code plus Fable 5 may include fallback to Opus 4.8 on refused tasks, so BenchLM preserves the official mixed-system row label instead of treating it as a pure base-model score.

53 harness rows152 ALE-V1 tasks13 domainsFull splitOfficial API snapshotDisplay only

Agents Last Exam leaderboard Leaderboard API Agent Showdown analysis GitHub repository

Pass rate on ALE-Bench — June 2026 API snapshot

BenchLM mirrors the published pass rate view for ALE-Bench. codex (reasoning-xhigh) / GPT-5.6-Sol leads the public snapshot at 30.6% , followed by codex (reasoning-high) / GPT-5.6-Sol (30.6%) and codex (reasoning-max) / GPT-5.6-Sol (29.6%). BenchLM does not use these results to rank models overall.

codex (reasoning-xhigh) / GPT-5.6-Sol

codex

codex:reasoning-xhigh/GPT-5.6-Sol

30.6%

Overall —

codex (reasoning-high) / GPT-5.6-Sol

codex

codex:reasoning-high/GPT-5.6-Sol

30.6%

Overall —

codex (reasoning-max) / GPT-5.6-Sol

codex

codex:reasoning-max/GPT-5.6-Sol

29.6%

Overall —

53 modelsExternal benchmark mirrorsCurrentDisplay onlyUpdated June 2026 API snapshot

Pass rate table (53 models)

Score

codex (reasoning-xhigh) / GPT-5.6-Solcodex

30.6%

codex (reasoning-high) / GPT-5.6-Solcodex

30.6%

codex (reasoning-max) / GPT-5.6-Solcodex

29.6%

codex (reasoning-xhigh) / GPT-5.6-Lunacodex

29.6%

codex (reasoning-medium) / GPT-5.6-Solcodex

29.5%

codex (reasoning-max) / GPT-5.6-Lunacodex

28.3%

codex (reasoning-max) / GPT-5.6-Terracodex

28.0%

codex (reasoning-xhigh) / GPT-5.6-Terracodex

27.6%

claude_code (thinking-max) / claude-opus-4-8claude_code

27.0%

codex (reasoning-xhigh) / gpt-5-5codex

26.6%

codex (reasoning-high) / GPT-5.6-Terracodex

26.0%

claude_code (reasoning-xhigh) / anthropic-claude-fable-5claude_code

25.7%

codex / gpt-5-5codex

24.2%

codex (reasoning-high) / GPT-5.6-Lunacodex

23.7%

codex (reasoning-low) / GPT-5.6-Solcodex

23.6%

ale_claw / gpt-5-5ale_claw

23.0%

codex (reasoning-medium) / GPT-5.6-Terracodex

22.7%

claude_code (thinking-xhigh) / claude-opus-4-8claude_code

22.4%

claude_code (thinking-high) / claude-opus-4-8claude_code

22.4%

claude_code / anthropic-claude-fable-5claude_code

22.0%

openclaw / gpt-5-5openclaw

21.1%

cursor_cli / gpt-5-5cursor_cli

20.7%

openclaw / gpt-5-4openclaw

20.5%

cursor_cli (thinking-high) / claude-opus-4-7cursor_cli

20.4%

claude_code (max) / glm-5-2claude_code

20.4%

codex (reasoning-low) / GPT-5.6-Terracodex

20.4%

cursor_cli / composer-2-5cursor_cli

20.4%

claude_code / ark-0614cclaude_code

19.1%

droid / gpt-5-5droid

19.1%

ale_claw / claude-opus-4-7ale_claw

18.4%

codex (reasoning-medium) / gpt-5-5codex

18.4%

codex (reasoning-medium) / GPT-5.6-Lunacodex

17.1%

codex (reasoning-low) / gpt-5-5codex

17.1%

claude_code / claude-opus-4-8claude_code

15.8%

gemini_cli / gemini-3-1-pro-previewgemini_cli

15.8%

openclaw / claude-opus-4-7openclaw

15.1%

openclaw / gemini-3-1-pro-previewopenclaw

14.1%

claude_code / claude-opus-4-7claude_code

13.2%

droid / claude-opus-4-7droid

12.8%

openclaw_cli / ark-0614copenclaw_cli

12.5%

openclaw / deepseek-v4-proopenclaw

12.4%

openclaw / qwen-qwen3-7-maxopenclaw

11.8%

codex (reasoning-low) / GPT-5.6-Lunacodex

11.8%

ale_claw / gpt-5-4ale_claw

11.8%

openclaw / glm-5-1openclaw

11.5%

openclaw / kimi-k2-6openclaw

9.2%

openclaw / qwen3-6-plusopenclaw

8.6%

openclaw / mimo-v2-5openclaw

8.6%

codex / gpt-5-4codex

7.2%

grok_cli / grok-4-3grok_cli

6.6%

openclaw / minimax-m2-7openclaw

5.9%

grok_cli / grok-3grok_cli

4.6%

openclaw / grok-4-3openclaw

4.3%

The published ALE-Bench snapshot places codex (reasoning-xhigh) / GPT-5.6-Sol first at 30.6%. The third row is 1.0 points behind. The broader top-10 range is 4.0 points, so many of the published results sit in a relatively narrow band.

53 models have been evaluated on ALE-Bench. The benchmark falls in the External benchmark mirrors category. We keep external benchmark mirrors separate from the weighted global scoring system, so these results remain source-specific evidence. ALE-Bench is currently displayed for reference but excluded from the scoring formula, so it does not directly affect overall rankings.

About ALE-Bench

Year

2026

Tasks

152 ALE-V1 professional workflow tasks across 13 top-level domains

Format

Pass rate, partial-credit score, cost, token, and duration metadata

Difficulty

Real-world agentic workflows

BenchLM mirrors the public Agents Last Exam full leaderboard API as ALE-Bench and links the June 2026 Agent Showdown analysis for domain, cost, speed, and failure-mode context. Rows combine base models with agent harnesses such as Codex, OpenClaw, Claude Code, Droid, Cursor CLI, and Gemini CLI, so the table remains display-only. The source notes that Claude Code plus Fable 5 may include upstream fallback to Opus 4.8 on refused tasks.

Agents Last Exam Public benchmark source

BenchLM freshness & provenance

Version

ALE-Bench 2026

Refresh cadence

Quarterly

Staleness state

Current

Question availability

Public benchmark set

CurrentDisplay only

BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.

FAQ

What does ALE-Bench measure?

A benchmark for agentic professional workflows with verifiable success criteria, reporting pass rates and partial scores for model plus agent-harness rows.

Which model leads the published ALE-Bench snapshot?

codex (reasoning-xhigh) / GPT-5.6-Sol currently leads the published ALE-Bench snapshot with 30.6% pass rate. BenchLM shows this benchmark for display only and does not use it in overall rankings.

How many models are evaluated on ALE-Bench?

53 AI models are included in BenchLM's mirrored ALE-Bench snapshot, based on the public leaderboard captured on June 2026 API snapshot.

Last updated: June 2026 API snapshot · mirrored from the public benchmark leaderboard

Choose a model with this week’s evidence

Join 2,000+ readers for ranking moves, pricing changes, and the claims that still need proof.

One email each week. Unsubscribe anytime.