tau2-bench (telecom) (tau2-bench)

A telecom-oriented tool benchmark that measures structured tool use in domain workflows.

Top Models on tau2-bench — March 2026

As of March 2026, GPT-5.4 leads the tau2-bench leaderboard with 98.9% , followed by GPT-5.4 mini (93.4%) and GPT-5.4 nano (92.5%).

4 modelsAgenticUpdated March 17, 2026

According to BenchLM.ai, GPT-5.4 leads the tau2-bench benchmark with a score of 98.9%, followed by GPT-5.4 mini (93.4%) and GPT-5.4 nano (92.5%). There is significant spread across the leaderboard, making this benchmark effective at differentiating model capabilities.

4 models have been evaluated on tau2-bench. The benchmark falls in the Agentic category, which carries a 22% weight in BenchLM.ai's overall scoring system. tau2-bench is currently displayed for reference but excluded from the scoring formula, so it does not directly affect overall rankings.

About tau2-bench

Year

2026

Tasks

Telecom tool workflows

Format

Domain-specific tool evaluation

Difficulty

Professional workflow

OpenAI reports tau2-bench as a domain-specific tool benchmark for telecom tasks, useful for measuring API-call reliability under constraints.

Introducing GPT-5.4 mini and nano

Leaderboard (4 models)

#1GPT-5.4
98.9%
#2GPT-5.4 mini
93.4%
#3GPT-5.4 nano
92.5%
#4GPT-5 mini
74.1%

FAQ

What does tau2-bench measure?

A telecom-oriented tool benchmark that measures structured tool use in domain workflows.

Which model scores highest on tau2-bench?

GPT-5.4 by OpenAI currently leads with a score of 98.9% on tau2-bench.

How many models are evaluated on tau2-bench?

4 AI models have been evaluated on tau2-bench on BenchLM.

Last updated: March 17, 2026

Weekly LLM Benchmark Digest

Get notified when new models drop, benchmark scores change, or the leaderboard shifts. One email per week.

Free. No spam. Unsubscribe anytime. We only store derived location metadata for consent routing.