Skip to main content

Terminal-Bench 2.0

A benchmark for agentic software engineering tasks executed in real terminal environments. Models must inspect files, run commands, edit code, and recover from errors over multi-step workflows.

Top models on Terminal-Bench 2.0 — April 10, 2026

As of April 10, 2026, Claude Mythos Preview leads the Terminal-Bench 2.0 leaderboard with 82% , followed by GPT-5.3 Codex (77.3%) and GPT-5.4 (75.1%).

22 modelsAgentic28% of category scoreCurrentUpdated April 10, 2026

According to BenchLM.ai, Claude Mythos Preview leads the Terminal-Bench 2.0 benchmark with a score of 82%, followed by GPT-5.3 Codex (77.3%) and GPT-5.4 (75.1%). There is significant spread across the leaderboard, making this benchmark effective at differentiating model capabilities.

22 models have been evaluated on Terminal-Bench 2.0. The benchmark falls in the Agentic category. This category carries a 22% weight in BenchLM.ai's overall scoring system. Within that category, Terminal-Bench 2.0 contributes 28% of the category score, so strong performance here directly affects a model's overall ranking.

About Terminal-Bench 2.0

Year

2026

Tasks

Terminal-based software tasks

Format

Interactive CLI agent evaluation

Difficulty

Professional software engineering

Terminal-Bench 2.0 focuses on realistic CLI and repository workflows rather than toy code generation. It is a strong proxy for how useful a model is inside coding agents and autonomous developer tools.

BenchLM freshness & provenance

Version

Terminal-Bench 2

Refresh cadence

Quarterly

Staleness state

Current

Question availability

Public benchmark set

Current

BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.

Leaderboard (22 models)

1
82%
2
77.3%
3
75.1%
4
65.4%
5
63.5%
6
61.6%
7
60%
8
59.3%
9
59.1%
10
59%
11
57%
12
56.2%
13
52.5%
14
50.8%
15
50.8%
16
50%
17
49.4%
18
47.1%
19
46.3%
20
41.6%
21
41%
22
40.5%

FAQ

What does Terminal-Bench 2.0 measure?

A benchmark for agentic software engineering tasks executed in real terminal environments. Models must inspect files, run commands, edit code, and recover from errors over multi-step workflows.

Which model scores highest on Terminal-Bench 2.0?

Claude Mythos Preview by Anthropic currently leads with a score of 82% on Terminal-Bench 2.0.

How many models are evaluated on Terminal-Bench 2.0?

22 AI models have been evaluated on Terminal-Bench 2.0 on BenchLM.

Last updated: April 10, 2026 · BenchLM version Terminal-Bench 2

The AI models change fast. We track them for you.

For engineers, researchers, and the plain curious — a weekly brief on new models, ranking shifts, and pricing changes.

Free. No spam. Unsubscribe anytime.