A tool-use benchmark focused on selecting, sequencing, and completing tasks with external tools.
As of March 2026, GPT-5.4 leads the Toolathlon leaderboard with 54.6% , followed by GPT-5.2 (46.3%) and GPT-5.4 mini (42.9%).
GPT-5.4
OpenAI
GPT-5.2
OpenAI
GPT-5.4 mini
OpenAI
According to BenchLM.ai, GPT-5.4 leads the Toolathlon benchmark with a score of 54.6%, followed by GPT-5.2 (46.3%) and GPT-5.4 mini (42.9%). There is significant spread across the leaderboard, making this benchmark effective at differentiating model capabilities.
5 models have been evaluated on Toolathlon. The benchmark falls in the Agentic category, which carries a 22% weight in BenchLM.ai's overall scoring system. Toolathlon is currently displayed for reference but excluded from the scoring formula, so it does not directly affect overall rankings.
Year
2026
Tasks
Multi-tool workflows
Format
Interactive tool-calling evaluation
Difficulty
Advanced tool use
Toolathlon is useful for judging whether a model can do more than answer in chat and instead complete multi-step tool workflows.
Introducing GPT-5.4 mini and nanoA tool-use benchmark focused on selecting, sequencing, and completing tasks with external tools.
GPT-5.4 by OpenAI currently leads with a score of 54.6% on Toolathlon.
5 AI models have been evaluated on Toolathlon on BenchLM.
Get notified when new models drop, benchmark scores change, or the leaderboard shifts. One email per week.
Free. No spam. Unsubscribe anytime. We only store derived location metadata for consent routing.