Skip to main content
Skip to main content

Agent & Tool-Use Benchmarks

Which AI models handle function calling, MCP tool use, browsing, and multi-step agent workflows best? Verified-ranked results across 24 agentic benchmarks.

Agentic carries 22% weightin BenchLM.ai's overall score — the single biggest category.

This page now shows only core agentic benchmark rows with an attached exact source record. Source-unverified manual rows are excluded from the displayed agentic score and table cells.

Best Agentic Model

GPT-5.4 Pro

Verified score: 89.3 · OpenAI

Best Open-Weight Agent

Holo3-35B-A3B

Verified score: 77.8 · H Company

Benchmarks Tracked

24 benchmarks

Terminal, browsing, tool-use, and computer-use

Benchmark Categories

Core Weighted (3)

These 3 benchmarks determine agentic rankings

Tool Calling & MCP (6)

Function calling, MCP tool use, and structured workflows

Agent Frameworks (6)

OpenClaw-style and end-to-end agent evaluations

Computer & Browser Use (5)

Desktop GUI, mobile, and browser navigation tasks

Specialized (4)

Domain-specific agentic tasks across ML, research, and airline

Top 15 Models by Weighted Agentic Score

OpenAIAnthropicGoogleMetaDeepSeekMistralxAIAlibaba

Core Benchmarks Radar — Top 5 Models

CSVJSON
#Model
1GPT-5.4 Pro

OpenAI · Proprietary

89.3
2Claude Mythos Preview

Anthropic · Proprietary

82.4
3Holo3-122B-A10B

H Company · Proprietary

78.9
4Holo3-35B-A3B

H Company · Open Weight

77.8
5GPT-5.4

OpenAI · Proprietary

77
6Claude Opus 4.7

Anthropic · Proprietary

74.9
7Claude Opus 4.6

Anthropic · Proprietary

72.6
8GPT-5.3 Codex

OpenAI · Proprietary

71.5
9GPT-5.4 mini

OpenAI · Proprietary

65.6
10Claude Sonnet 4.6

Anthropic · Proprietary

65.3
11GLM-5.1

Z.AI · Open Weight

65.3
12Claude Opus 4.5

Anthropic · Proprietary

62.5
13Composer 2

Cursor · Proprietary

61.7
14Qwen3.6 Plus

Alibaba · Proprietary

61.6
15Muse Spark

Meta · Proprietary

59
16MiniMax M2.7

MiniMax · Open Weight

57
17GLM-5

Z.AI · Open Weight

56.2
18Qwen3.5 397B

Alibaba · Open Weight

56.2
19Qwen3.5-122B-A10B

Alibaba · Open Weight

56.1
20Claude Sonnet 4.5

Anthropic · Proprietary

55.3
21GPT-5.2

OpenAI · Proprietary

55.2
22Kimi K2.5 (Reasoning)

Moonshot AI · Proprietary

54.6
23Kimi K2.5

Moonshot AI · Open Weight

54.6
24Qwen3.5-27B

Alibaba · Open Weight

51.6
25Qwen3.6-35B-A3B

Alibaba · Open Weight

51.5
26Qwen3.5-35B-A3B

Alibaba · Open Weight

50.6
27Grok 4.20

xAI · Proprietary

47.1
28GLM-4.7

Z.AI · Open Weight

45.3
29GPT-5.4 nano

OpenAI · Proprietary

42.9

Agentic score = weighted average of Terminal-Bench 2.0 (40%), OSWorld-Verified (35%), and BrowseComp (25%), normalized by available weights. This page intentionally stays on BenchLM's verified ranking lane and only includes exact-source rows. Display-only benchmarks (MCP Atlas, Toolathlon, etc.) are tracked but do not affect rankings.

Frequently Asked Questions

What are LLM agent benchmarks?

Agent benchmarks test whether AI models can go beyond answering questions and actually complete multi-step tasks: browsing the web, writing and running code in a terminal, calling external APIs via function calling, and operating desktop or mobile interfaces. They measure real-world usefulness for autonomous workflows.

What is function calling and why does it matter?

Function calling (or tool use) lets an LLM invoke external tools, APIs, or databases as part of its response. This is critical for building AI agents that can search the web, query databases, send emails, or control other software. Benchmarks like BFCL v4 and Toolathlon specifically measure how reliably models select the right function and pass correct arguments.

What is MCP (Model Context Protocol)?

MCP is an open standard for connecting LLMs to external tools and data sources. MCP Atlas and MCP-Tasks benchmark how well models work with MCP-backed integrations. Strong MCP performance means a model integrates well into tool-rich agent architectures.

Why does agentic carry the most weight in BenchLM scores?

Agentic carries 22% of BenchLM's overall score because the ability to use tools, browse, and complete multi-step tasks is the strongest differentiator between models in production use. A model that scores well on knowledge but cannot reliably call functions or navigate software has limited real-world utility for agent workflows.

Which models are best for building AI agents?

Currently, GPT-5.4 Pro by OpenAI leads BenchLM's verified agentic rankings with a score of 89.3. The best open-weight agent model is Holo3-35B-A3B (77.8). Check the leaderboard above for the full verified ranking.

The AI models change fast. We track them for you.

For engineers, researchers, and the plain curious — a weekly brief on new models, ranking shifts, and pricing changes.

Free. No spam. Unsubscribe anytime.