A benchmark for agentic software engineering tasks executed in real terminal environments. Models must inspect files, run commands, edit code, and recover from errors over multi-step workflows.
According to BenchLM.ai, GPT-5.4 Pro leads the Terminal-Bench 2.0 benchmark with a score of 90, followed by GPT-5.4 (90) and GPT-5.3 Codex (90). The top models are clustered within 0 points, suggesting this benchmark is nearing saturation for frontier models.
121 models have been evaluated on Terminal-Bench 2.0. The benchmark falls in the agentic category, which carries a 22% weight in BenchLM.ai's overall scoring system. Strong performance here directly impacts a model's overall ranking.
Year
2026
Tasks
Terminal-based software tasks
Format
Interactive CLI agent evaluation
Difficulty
Professional software engineering
Terminal-Bench 2.0 focuses on realistic CLI and repository workflows rather than toy code generation. It is a strong proxy for how useful a model is inside coding agents and autonomous developer tools.
Terminal-Bench 2.0A benchmark for agentic software engineering tasks executed in real terminal environments. Models must inspect files, run commands, edit code, and recover from errors over multi-step workflows.
GPT-5.4 Pro by OpenAI currently leads with a score of 90 on Terminal-Bench 2.0.
121 AI models have been evaluated on Terminal-Bench 2.0 on BenchLM.
Get notified when new models drop, benchmark scores change, or the leaderboard shifts. One email per week.
Free. No spam. Unsubscribe anytime. We only store derived location metadata for consent routing.