A benchmark for agentic software engineering tasks executed in real terminal environments. Models must inspect files, run commands, edit code, and recover from errors over multi-step workflows.
As of April 10, 2026, Claude Mythos Preview leads the Terminal-Bench 2.0 leaderboard with 82% , followed by GPT-5.3 Codex (77.3%) and GPT-5.4 (75.1%).
Claude Mythos Preview
Anthropic
GPT-5.3 Codex
OpenAI
GPT-5.4
OpenAI
According to BenchLM.ai, Claude Mythos Preview leads the Terminal-Bench 2.0 benchmark with a score of 82%, followed by GPT-5.3 Codex (77.3%) and GPT-5.4 (75.1%). There is significant spread across the leaderboard, making this benchmark effective at differentiating model capabilities.
22 models have been evaluated on Terminal-Bench 2.0. The benchmark falls in the Agentic category. This category carries a 22% weight in BenchLM.ai's overall scoring system. Within that category, Terminal-Bench 2.0 contributes 28% of the category score, so strong performance here directly affects a model's overall ranking.
Year
2026
Tasks
Terminal-based software tasks
Format
Interactive CLI agent evaluation
Difficulty
Professional software engineering
Terminal-Bench 2.0 focuses on realistic CLI and repository workflows rather than toy code generation. It is a strong proxy for how useful a model is inside coding agents and autonomous developer tools.
Version
Terminal-Bench 2
Refresh cadence
Quarterly
Staleness state
Current
Question availability
Public benchmark set
BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.
A benchmark for agentic software engineering tasks executed in real terminal environments. Models must inspect files, run commands, edit code, and recover from errors over multi-step workflows.
Claude Mythos Preview by Anthropic currently leads with a score of 82% on Terminal-Bench 2.0.
22 AI models have been evaluated on Terminal-Bench 2.0 on BenchLM.
For engineers, researchers, and the plain curious — a weekly brief on new models, ranking shifts, and pricing changes.
Free. No spam. Unsubscribe anytime.