Skip to main content

Vals-hosted Terminal-Bench 2.0 mirror (Vals Terminal-Bench 2.0 mirror)

Vals AI hosted Terminal-Bench 2.0 view with easy, medium, and hard task splits.

How BenchLM shows Vals Terminal-Bench 2.0 mirror

BenchLM mirrors the public Vals AI Vals Terminal-Bench 2.0 mirror leaderboard captured from https://www.vals.ai/benchmarks/terminal-bench-2 and updated by Vals on May 16, 2026. The snapshot preserves overall scores, uncertainty, latency, cost-per-test metadata, and task-level scores where Vals publishes them.

Vals Terminal-Bench 2.0 mirror is display only on BenchLM. Vals proprietary or Vals-hosted aggregate views are useful context, but BenchLM does not use them as weighted ranking inputs or as a replacement for benchmark-native source records.

62 Vals rows4 task viewspublic datasetTasks: Overall, Easy, Medium, HardDisplay only

Vals Terminal-Bench score on Vals Terminal-Bench 2.0 mirror — May 16, 2026

BenchLM mirrors the published vals terminal-bench score view for Vals Terminal-Bench 2.0 mirror. GPT-5.5 leads the public snapshot at 73.20% , followed by Claude Opus 4.7 (68.54%) and Gemini 3.5 Flash (67.42%). BenchLM does not use these results to rank models overall.

62 modelsExternal benchmark mirrorsCurrentDisplay onlyUpdated May 16, 2026

The published Vals Terminal-Bench 2.0 mirror snapshot is tightly clustered at the top: GPT-5.5 sits at 73.20%, while the third row is only 5.79 points behind. The broader top-10 spread is 14.77 points, so the benchmark still separates strong models even when the leaders cluster.

62 models have been evaluated on Vals Terminal-Bench 2.0 mirror. The benchmark falls in the External benchmark mirrors category. BenchLM tracks this category separately from its weighted global scoring system, so these results are best compared on the dedicated Korean benchmark views. Vals Terminal-Bench 2.0 mirror is currently displayed for reference but excluded from the scoring formula, so it does not directly affect overall rankings.

About Vals Terminal-Bench 2.0 mirror

Year

2026

Tasks

Terminal task difficulty splits

Format

Accuracy score

Difficulty

Terminal-based agent execution

BenchLM mirrors this Vals-hosted Terminal-Bench view as display-only secondary context.

BenchLM freshness & provenance

Version

Vals Terminal-Bench 2.0 mirror 2026

Refresh cadence

Quarterly

Staleness state

Current

Question availability

Public benchmark set

CurrentDisplay only

BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.

Vals Terminal-Bench score table (62 models)

1
GPT-5.5openai/gpt-5.5
73.20%
2
Claude Opus 4.7anthropic/claude-opus-4-7
68.54%
3
Gemini 3.5 Flashgoogle/gemini-3.5-flash
67.42%
4
Gemini 3.1 Pro Previewgoogle/gemini-3.1-pro-preview
67.42%
5
GPT-5.3 Codexopenai/gpt-5.3-codex
64.05%
6
Muse Sparkmeta/muse_spark
59.55%
7
Claude Sonnet 4.6anthropic/claude-sonnet-4-6
59.55%
8
Claude Opus 4.5anthropic/claude-opus-4-5-20251101
58.43%
9
Claude Opus 4.6 Thinkinganthropic/claude-opus-4-6-thinking
58.43%
10
GPT-5.4openai/gpt-5.4-2026-03-05
58.43%
11
Kimi K2.6 Thinkingkimi/kimi-k2.6-thinking
57.30%
12
DeepSeek V4 Prodeepseek/deepseek-v4-pro
56.18%
13
Gemini 3 Pro Previewgoogle/gemini-3-pro-preview
55.06%
14
Claude Opus 4.5 20251101 Thinkinganthropic/claude-opus-4-5-20251101-thinking
53.93%
15
GLM 5.1 Thinkingzai/glm-5.1-thinking
53.93%
16
Gemini 3 Flash Previewgoogle/gemini-3-flash-preview
51.69%
17
GPT-5.2openai/gpt-5.2-2025-12-11
51.69%
18
Qwen3.6 Max Previewalibaba/qwen3.6-max-preview
51.69%
19
GLM 5 Thinkingzai/glm-5-thinking
49.44%
20
MiniMax M2.7minimax/MiniMax-M2.7
47.19%
21
Qwen3.6 Plusalibaba/qwen3.6-plus
44.94%
22
GPT-5.1openai/gpt-5.1-2025-11-13
44.94%
23
Qwen3.6 27balibaba/qwen3.6-27b
44.94%
24
GPT-5.4 Miniopenai/gpt-5.4-mini-2026-03-17
44.94%
25
Grok 4.3grok/grok-4.3
43.45%
26
Claude Sonnet 4.5 20250929 Thinkinganthropic/claude-sonnet-4-5-20250929-thinking
41.57%
27
MiniMax M2.5 Lightningminimax/MiniMax-M2.5-Lightning
41.57%
28
Qwen3.5 Plus Thinkingalibaba/qwen3.5-plus-thinking
41.57%
29
Grok 4.20 0309 Reasoninggrok/grok-4.20-0309-reasoning
40.45%
30
Kimi K2.5 Thinkingkimi/kimi-k2.5-thinking
40.45%
31
GPT-5.4 Nanoopenai/gpt-5.4-nano-2026-03-17
39.89%
32
Gemma 4 31b Itgoogle/gemma-4-31b-it
39.33%
33
Claude Haiku 4.5 20251001 Thinkinganthropic/claude-haiku-4-5-20251001-thinking
38.20%
34
GLM 4.7zai/glm-4.7
38.20%
35
MiniMax M2.1minimax/MiniMax-M2.1
37.08%
36
GPT-5openai/gpt-5-2025-08-07
37.08%
37
Kimi K2 Thinkingkimi/kimi-k2-thinking
37.08%
38
DeepSeek V3p2 Thinkingfireworks/deepseek-v3p2-thinking
35.95%
39
DeepSeek V3p2fireworks/deepseek-v3p2
34.83%
40
Gemini 2.5 Progoogle/gemini-2.5-pro
30.34%
41
Mistral Medium 3.5mistralai/mistral-medium-3.5
30.34%
42
Grok 4 Fast Reasoninggrok/grok-4-fast-reasoning
29.21%
43
GLM 4.6zai/glm-4.6
28.09%
44
Grok 4 0709grok/grok-4-0709
28.09%
45
GPT-5 Miniopenai/gpt-5-mini-2025-08-07
26.97%
46
Moonshotai Kimi K2 Instructtogether/moonshotai/Kimi-K2-Instruct
25.84%
47
Grok 4.1 Fast Reasoninggrok/grok-4-1-fast-reasoning
24.72%
48
Gemini 3.1 Flash Lite Previewgoogle/gemini-3.1-flash-lite-preview
24.72%
49
Qwen3.5 Flashalibaba/qwen3.5-flash
24.72%
50
Qwen3 Maxalibaba/qwen3-max
24.72%
51
DeepSeek V3p1fireworks/deepseek-v3p1
22.47%
52
Gemini 2.5 Flash Preview 09 2025 Thinkinggoogle/gemini-2.5-flash-preview-09-2025-thinking
21.35%
53
Qwen3 Maxalibaba/qwen3-max-2026-01-23
20.23%
54
GPT Oss 120bfireworks/gpt-oss-120b
19.10%
55
Grok 4.1 Fast Non Reasoninggrok/grok-4-1-fast-non-reasoning
17.98%
56
Trinity Large Thinkingarcee-ai/trinity-large-thinking
17.98%
57
Mistral Small 2603mistralai/mistral-small-2603
16.85%
58
GPT-4.1openai/gpt-4.1-2025-04-14
14.61%
59
Magistral Medium 2509mistralai/magistral-medium-2509
13.48%
60
Mistral Large 2512mistralai/mistral-large-2512
8.99%
61
Command A 03 2025cohere/command-a-03-2025
2.25%
62
Llama4 Maverick Instruct Basicfireworks/llama4-maverick-instruct-basic
2.25%

FAQ

What does Vals Terminal-Bench 2.0 mirror measure?

Vals AI hosted Terminal-Bench 2.0 view with easy, medium, and hard task splits.

Which model leads the published Vals Terminal-Bench 2.0 mirror snapshot?

GPT-5.5 currently leads the published Vals Terminal-Bench 2.0 mirror snapshot with 73.20% vals terminal-bench score. BenchLM shows this benchmark for display only and does not use it in overall rankings.

How many models are evaluated on Vals Terminal-Bench 2.0 mirror?

62 AI models are included in BenchLM's mirrored Vals Terminal-Bench 2.0 mirror snapshot, based on the public leaderboard captured on May 16, 2026.

Last updated: May 16, 2026 · mirrored from the public benchmark leaderboard

The AI models change fast. We track them for you.

For engineers, researchers, and the plain curious — a weekly brief on new models, ranking shifts, and pricing changes.

Free. No spam. Unsubscribe anytime.