Skip to main content

Terminal-Bench Hard

A display-only Artificial Analysis coding metric for agentic coding and terminal use on a harder Terminal-Bench slice.

Benchmark score on Terminal-Bench Hard — June 13, 2026

BenchLM mirrors the published score view for Terminal-Bench Hard. GPT-5.5 leads the public snapshot at 60.6% , followed by Claude Opus 4.8 (58.3%) and GPT-5.4 (57.6%). BenchLM does not use these results to rank models overall.

123 modelsCodingCurrentDisplay onlyUpdated June 13, 2026

The published Terminal-Bench Hard snapshot is tightly clustered at the top: GPT-5.5 sits at 60.6%, while the third row is only 3.0 points behind. The broader top-10 spread is 12.1 points, so the benchmark still separates strong models even when the leaders cluster.

123 models have been evaluated on Terminal-Bench Hard. The benchmark falls in the Coding category. This category carries a 20% weight in BenchLM.ai's overall scoring system. Terminal-Bench Hard is currently displayed for reference but excluded from the scoring formula, so it does not directly affect overall rankings.

About Terminal-Bench Hard

Year

2026

Tasks

Agentic coding and terminal tasks

Format

Task success rate

Difficulty

Professional software engineering

BenchLM stores Terminal-Bench Hard separately from Terminal-Bench 2.0 because OpenRouter and Artificial Analysis publish it as a distinct benchmark card.

BenchLM freshness & provenance

Version

Terminal-Bench Hard 2026

Refresh cadence

Quarterly

Staleness state

Current

Question availability

Public benchmark set

CurrentDisplay only

BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.

Benchmark score table (123 models)

1
60.6%
2
58.3%
3
57.6%
4
54.5%
5
53.8%
6
53.0%
7
52.3%
8
51.5%
9
50.8%
10
48.5%
11
47.0%
12
47.0%
13
47.0%
14
46.2%
15
46.2%
16
46.2%
17
45.5%
18
45.5%
19
43.9%
20
43.9%
21
43.9%
22
43.2%
23
43.2%
24
43.2%
25
42.4%
26
42.4%
27
41.7%
28
41.7%
29
40.9%
30
40.9%
31
40.9%
32
40.9%
33
39.4%
34
38.6%
35
37.9%
36
37.9%
37
37.9%
38
37.1%
39
37.1%
40
36.4%
41
36.4%
42
35.6%
43
35.6%
44
35.6%
45
34.8%
46
34.8%
47
34.8%
48
34.8%
49
34.8%
50
34.8%
51
34.8%
52
34.3%
53
34.1%
54
33.3%
55
33.3%
56
32.6%
57
32.6%
58
32.6%
59
32.6%
60
31.8%
61
31.8%
62
31.1%
63
28.8%
64
27.3%
65
26.5%
66
26.5%
67
25.8%
68
25.0%
69
25.0%
70
24.2%
71
24.2%
73
23.5%
74
22.7%
75
22.7%
76
22.7%
77
21.2%
78
20.5%
79
20.5%
80
18.9%
81
18.2%
82
17.4%
83
17.4%
84
17.4%
85
15.9%
86
15.9%
87
15.9%
88
14.4%
89
13.6%
90
13.6%
91
12.9%
92
12.1%
93
12.1%
94
10.6%
95
8.3%
97
8.3%
98
7.6%
99
6.8%
100
6.8%
101
6.8%
102
6.8%
103
6.1%
104
6.1%
105
4.5%
106
4.5%
107
3.8%
108
3.8%
109
3.8%
110
3.8%
111
3.0%
112
2.3%
113
2.3%
114
1.5%
115
1.5%
116
1.5%
117
0.8%
118
0.0%
119
0.0%
120
0.0%
121
0.0%
122
0.0%
123
0.0%

FAQ

What does Terminal-Bench Hard measure?

A display-only Artificial Analysis coding metric for agentic coding and terminal use on a harder Terminal-Bench slice.

Which model scores highest on Terminal-Bench Hard?

GPT-5.5 by OpenAI currently leads with a score of 60.6% on Terminal-Bench Hard.

How many models are evaluated on Terminal-Bench Hard?

123 AI models have been evaluated on Terminal-Bench Hard on BenchLM.

Last updated: June 13, 2026 · BenchLM version Terminal-Bench Hard 2026

The AI models change fast. We track them for you.

For engineers, researchers, and the plain curious — a weekly brief on new models, ranking shifts, and pricing changes.

Free. No spam. Unsubscribe anytime.