Terminal-Bench 2.0 (Terminal-Bench 2.0)

A benchmark for agentic software engineering tasks executed in real terminal environments. Models must inspect files, run commands, edit code, and recover from errors over multi-step workflows.

According to BenchLM.ai, GPT-5.4 Pro leads the Terminal-Bench 2.0 benchmark with a score of 90, followed by GPT-5.4 (90) and GPT-5.3 Codex (90). The top models are clustered within 0 points, suggesting this benchmark is nearing saturation for frontier models.

121 models have been evaluated on Terminal-Bench 2.0. The benchmark falls in the agentic category, which carries a 22% weight in BenchLM.ai's overall scoring system. Strong performance here directly impacts a model's overall ranking.

About Terminal-Bench 2.0

Year

2026

Tasks

Terminal-based software tasks

Format

Interactive CLI agent evaluation

Difficulty

Professional software engineering

Terminal-Bench 2.0 focuses on realistic CLI and repository workflows rather than toy code generation. It is a strong proxy for how useful a model is inside coding agents and autonomous developer tools.

Terminal-Bench 2.0

Leaderboard (121 models)

#1GPT-5.4 Pro
90
#2GPT-5.4
90
#3GPT-5.3 Codex
90
#4GPT-5.2
90
#6GPT-5.2-Codex
90
#8GPT-5.2 Pro
88
#9GPT-5.3 Instant
86
#10GPT-5.2 Instant
83
#11GLM-5 (Reasoning)
81
#12Claude Opus 4.6
80
#13Grok 4.1
79
#14GPT-5.1
78
#15GPT-5 (high)
78
#16Gemini 3.1 Pro
77
#18GPT-5 (medium)
77
#19o1-preview
77
#21Kimi K2.5 (Reasoning)
75
#23DeepSeek Coder 2.0
73
#24Claude Opus 4.5
71
#26o3
71
#27Claude Sonnet 4.6
70
#28Claude Sonnet 4.5
69
#29o3-pro
69
#30Gemini 3 Pro
68
#31GPT-5 mini
68
#32o3-mini
67
#33GLM-4.7
67
#34o1
66
#35DeepSeekMath V2
65
#36Qwen2.5-1M
65
#37GLM-4.7-Flash
64
#38MiMo-V2-Flash
63
#39GLM-5
63
#40Mercury 2
63
#41Seed 1.6
63
#43Step 3.5 Flash
62
#44Gemini 2.5 Pro
61
#45GPT-4.1
61
#46DeepSeek V3.2
60
#49Claude 4.1 Opus
58
#50o4-mini (high)
58
#51Qwen3.5 397B
58
#52GPT-4o mini
58
#53Grok 4
57
#54DeepSeek LLM 2.0
57
#55Claude 4 Sonnet
56
#56Qwen2.5-72B
56
#57Gemini 3 Flash
56
#60Claude 3.5 Sonnet
54
#61GPT-4.1 mini
54
#62Claude Haiku 4.5
53
#64Seed-2.0-Lite
52
#65Mistral Large 3
52
#66Seed 1.6 Flash
52
#67Kimi K2.5
51
#68MiniMax M2.5
51
#69Mistral Large 2
51
#70GPT-4o
49
#71Aion-2.0
48
#72Ministral 3 14B
48
#76Gemini 1.5 Pro
45
#78Claude 3 Opus
44
#79Gemini 2.5 Flash
44
#80Phi-4
44
#81Seed-2.0-Mini
43
#82GPT-4.1 nano
43
#83GPT-OSS 120B
43
#84GPT-4 Turbo
42
#85DeepSeek-R1
42
#87DBRX Instruct
41
#88Claude 3 Haiku
40
#89Mistral 8x7B
40
#90o1-pro
40
#91Moonshot v1
39
#92Z-1
39
#93Llama 4 Scout
39
#95GPT-5 nano
38
#97Nemotron-4 15B
37
#98Llama 3 70B
37
#100Gemini 1.0 Pro
36
#101GPT-OSS 20B
35
#103LFM2.5-1.2B-Thinking
34
#105Qwen3 235B 2507
33
#107Grok 3 [Beta]
32
#108Nova Pro
31
#109LFM2-24B-A2B
30
#110Qwen2.5-VL-32B
30
#111MiniMax M1 80k
30
#112Gemma 3 27B
29
#113DeepSeek V3.1
29
#114GLM-4.5
28
#115GLM-4.5-Air
28
#116Kimi K2
27
#117Ministral 3 8B
26
#118Mistral 7B v0.3
24
#119Mistral 8x7B v0.2
24
#120LFM2.5-1.2B-Instruct
22
#121Ministral 3 3B
19

FAQ

What does Terminal-Bench 2.0 measure?

A benchmark for agentic software engineering tasks executed in real terminal environments. Models must inspect files, run commands, edit code, and recover from errors over multi-step workflows.

Which model scores highest on Terminal-Bench 2.0?

GPT-5.4 Pro by OpenAI currently leads with a score of 90 on Terminal-Bench 2.0.

How many models are evaluated on Terminal-Bench 2.0?

121 AI models have been evaluated on Terminal-Bench 2.0 on BenchLM.

Last updated: March 12, 2026

Weekly LLM Benchmark Digest

Get notified when new models drop, benchmark scores change, or the leaderboard shifts. One email per week.

Free. No spam. Unsubscribe anytime. We only store derived location metadata for consent routing.