Skip to main content

Tau2-Telecom

A telecom-oriented tool benchmark that measures structured tool use in domain workflows.

Benchmark score on Tau2-Telecom — June 13, 2026

BenchLM mirrors the published score view for Tau2-Telecom. Step 3.7 Flash leads the public snapshot at 98.5% , followed by GLM-5V-Turbo (98.5%) and GLM-5-Turbo (98.5%). BenchLM does not use these results to rank models overall.

123 modelsAgenticCurrentDisplay onlyUpdated June 13, 2026

The published Tau2-Telecom snapshot is tightly clustered at the top: Step 3.7 Flash sits at 98.5%, while the third row is only 0.0 points behind. The broader top-10 spread is 2.6 points, so many of the published scores sit in a relatively narrow band.

123 models have been evaluated on Tau2-Telecom. The benchmark falls in the Agentic category. This category carries a 22% weight in BenchLM.ai's overall scoring system. Tau2-Telecom is currently displayed for reference but excluded from the scoring formula, so it does not directly affect overall rankings.

About Tau2-Telecom

Year

2026

Tasks

Telecom tool workflows

Format

Domain-specific tool evaluation

Difficulty

Professional workflow

OpenAI reports tau2-bench as a domain-specific tool benchmark for telecom tasks, useful for measuring API-call reliability under constraints.

BenchLM freshness & provenance

Version

τ²-Bench 2026

Refresh cadence

Quarterly

Staleness state

Current

Question availability

Public benchmark set

CurrentDisplay only

BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.

Benchmark score table (123 models)

1
98.5%
2
98.5%
3
98.5%
4
98.2%
5
97.7%
6
97.7%
7
97.7%
8
96.2%
9
95.9%
10
95.9%
11
95.9%
12
95.9%
13
95.9%
14
95.6%
15
95.6%
16
95.6%
17
95.3%
18
95.3%
19
95%
20
95%
21
94.7%
22
94.4%
23
94.2%
24
94.2%
25
94.2%
26
94.2%
27
93.9%
28
93.9%
29
93.6%
31
93%
32
92.7%
33
92.1%
34
92.1%
35
91.5%
36
91.2%
37
90.1%
38
90.1%
39
89.5%
40
89.2%
41
88.9%
42
88.6%
43
87.1%
44
87.1%
45
86.5%
46
86.3%
47
86%
48
86%
49
84.8%
50
84.8%
51
84.8%
52
84.8%
53
83.9%
54
83.9%
55
83.3%
56
83.3%
57
83%
58
83%
59
81.9%
60
80.7%
61
80.7%
62
79.5%
63
78.9%
64
76.9%
65
76%
66
75.7%
67
74.9%
68
74.3%
69
74.3%
70
74%
71
71.4%
72
65.8%
73
65.8%
74
63.7%
75
62.6%
76
61.1%
77
60.2%
78
59.9%
79
54.1%
80
52.9%
81
52.3%
82
47.1%
83
46.8%
84
46.5%
86
43.6%
87
43.3%
88
41.2%
89
41.2%
90
37.4%
91
36.5%
92
36.3%
93
34.8%
94
34.5%
95
31.9%
96
31.3%
97
30.7%
98
28.7%
99
25.4%
100
25.1%
101
24.6%
102
24.3%
103
22.8%
104
22.8%
105
21.1%
106
20.8%
107
20.8%
108
20.5%
109
19.6%
110
19%
111
17.8%
112
17.3%
113
16.1%
114
15.5%
115
14.9%
116
14.6%
117
14%
118
13.2%
119
11.4%
120
10.5%
121
8.5%
122
4.1%
123
0%

FAQ

What does Tau2-Telecom measure?

A telecom-oriented tool benchmark that measures structured tool use in domain workflows.

Which model scores highest on Tau2-Telecom?

Step 3.7 Flash by StepFun currently leads with a score of 98.5% on Tau2-Telecom.

How many models are evaluated on Tau2-Telecom?

123 AI models have been evaluated on Tau2-Telecom on BenchLM.

Last updated: June 13, 2026 · BenchLM version τ²-Bench 2026

The AI models change fast. We track them for you.

For engineers, researchers, and the plain curious — a weekly brief on new models, ranking shifts, and pricing changes.

Free. No spam. Unsubscribe anytime.