Agent & Tool-Use Benchmarks
Which AI models handle function calling, MCP tool use, browsing, and multi-step agent workflows best? Verified-ranked results across 24 agentic benchmarks.
Agentic carries 22% weightin BenchLM.ai's overall score — the single biggest category.
This page now shows only core agentic benchmark rows with an attached exact source record. Source-unverified manual rows are excluded from the displayed agentic score and table cells.
GPT-5.4 Pro
Verified score: 89.3 · OpenAI
Holo3-35B-A3B
Verified score: 77.8 · H Company
24 benchmarks
Terminal, browsing, tool-use, and computer-use
Benchmark Categories
Core Weighted (3)
These 3 benchmarks determine agentic rankings
Tool Calling & MCP (6)
Function calling, MCP tool use, and structured workflows
Agent Frameworks (6)
OpenClaw-style and end-to-end agent evaluations
Computer & Browser Use (5)
Desktop GUI, mobile, and browser navigation tasks
Specialized (4)
Domain-specific agentic tasks across ML, research, and airline
Top 15 Models by Weighted Agentic Score
Core Benchmarks Radar — Top 5 Models
| # | Model | |
|---|---|---|
| 1 | GPT-5.4 Pro OpenAI · Proprietary | 89.3 |
| 2 | Claude Mythos Preview Anthropic · Proprietary | 82.4 |
| 3 | Holo3-122B-A10B H Company · Proprietary | 78.9 |
| 4 | Holo3-35B-A3B H Company · Open Weight | 77.8 |
| 5 | GPT-5.4 OpenAI · Proprietary | 77 |
| 6 | Claude Opus 4.7 Anthropic · Proprietary | 74.9 |
| 7 | Claude Opus 4.6 Anthropic · Proprietary | 72.6 |
| 8 | GPT-5.3 Codex OpenAI · Proprietary | 71.5 |
| 9 | GPT-5.4 mini OpenAI · Proprietary | 65.6 |
| 10 | Claude Sonnet 4.6 Anthropic · Proprietary | 65.3 |
| 11 | GLM-5.1 Z.AI · Open Weight | 65.3 |
| 12 | Claude Opus 4.5 Anthropic · Proprietary | 62.5 |
| 13 | Composer 2 Cursor · Proprietary | 61.7 |
| 14 | Qwen3.6 Plus Alibaba · Proprietary | 61.6 |
| 15 | Muse Spark Meta · Proprietary | 59 |
| 16 | MiniMax M2.7 MiniMax · Open Weight | 57 |
| 17 | GLM-5 Z.AI · Open Weight | 56.2 |
| 18 | Qwen3.5 397B Alibaba · Open Weight | 56.2 |
| 19 | Qwen3.5-122B-A10B Alibaba · Open Weight | 56.1 |
| 20 | Claude Sonnet 4.5 Anthropic · Proprietary | 55.3 |
| 21 | GPT-5.2 OpenAI · Proprietary | 55.2 |
| 22 | Kimi K2.5 (Reasoning) Moonshot AI · Proprietary | 54.6 |
| 23 | Kimi K2.5 Moonshot AI · Open Weight | 54.6 |
| 24 | Qwen3.5-27B Alibaba · Open Weight | 51.6 |
| 25 | Qwen3.6-35B-A3B Alibaba · Open Weight | 51.5 |
| 26 | Qwen3.5-35B-A3B Alibaba · Open Weight | 50.6 |
| 27 | Grok 4.20 xAI · Proprietary | 47.1 |
| 28 | GLM-4.7 Z.AI · Open Weight | 45.3 |
| 29 | GPT-5.4 nano OpenAI · Proprietary | 42.9 |
Agentic score = weighted average of Terminal-Bench 2.0 (40%), OSWorld-Verified (35%), and BrowseComp (25%), normalized by available weights. This page intentionally stays on BenchLM's verified ranking lane and only includes exact-source rows. Display-only benchmarks (MCP Atlas, Toolathlon, etc.) are tracked but do not affect rankings.
BenchLM category page with provisional and verified ranking modes
SWE-bench, LiveCodeBench, and more
Find the best value for agent workloads
Frequently Asked Questions
What are LLM agent benchmarks?
Agent benchmarks test whether AI models can go beyond answering questions and actually complete multi-step tasks: browsing the web, writing and running code in a terminal, calling external APIs via function calling, and operating desktop or mobile interfaces. They measure real-world usefulness for autonomous workflows.
What is function calling and why does it matter?
Function calling (or tool use) lets an LLM invoke external tools, APIs, or databases as part of its response. This is critical for building AI agents that can search the web, query databases, send emails, or control other software. Benchmarks like BFCL v4 and Toolathlon specifically measure how reliably models select the right function and pass correct arguments.
What is MCP (Model Context Protocol)?
MCP is an open standard for connecting LLMs to external tools and data sources. MCP Atlas and MCP-Tasks benchmark how well models work with MCP-backed integrations. Strong MCP performance means a model integrates well into tool-rich agent architectures.
Why does agentic carry the most weight in BenchLM scores?
Agentic carries 22% of BenchLM's overall score because the ability to use tools, browse, and complete multi-step tasks is the strongest differentiator between models in production use. A model that scores well on knowledge but cannot reliably call functions or navigate software has limited real-world utility for agent workflows.
Which models are best for building AI agents?
Currently, GPT-5.4 Pro by OpenAI leads BenchLM's verified agentic rankings with a score of 89.3. The best open-weight agent model is Holo3-35B-A3B (77.8). Check the leaderboard above for the full verified ranking.
The AI models change fast. We track them for you.
For engineers, researchers, and the plain curious — a weekly brief on new models, ranking shifts, and pricing changes.
Free. No spam. Unsubscribe anytime.