Agent & Tool-Use Benchmarks

Which AI models handle function calling, MCP tool use, browsing, and multi-step agent workflows best? Verified-ranked results across 24 agentic benchmarks.

Agentic carries 22% weightin BenchLM.ai's overall score — the single biggest category.

This page now shows only core agentic benchmark rows with an attached exact source record. Source-unverified manual rows are excluded from the displayed agentic score and table cells.

Best Agentic Model

GPT-5.5 Pro

Verified score: 90.1 · OpenAI

Best Open-Weight Agent

Holo3-35B-A3B

Verified score: 82.6 · H Company

Benchmarks Tracked

26 benchmarks

Terminal, browsing, tool-use, and computer-use

Benchmark Categories

Core Weighted (3)

These 3 benchmarks determine agentic rankings

Terminal-Bench 2.0 BrowseComp OSWorld-Verified

Tool Calling & MCP (6)

Function calling, MCP tool use, and structured workflows

MCP Atlas MCP-Tasks Toolathlon BFCL v4 Tau2-Telecom TAU3-Bench

Agent Frameworks (8)

OpenClaw-style and end-to-end agent evaluations

ZClawBench QwenClawBench QwenWebBench MM-ClawBench Claw-Eval PinchBench OpenHands Index SWE-Atlas Refactoring

Computer & Browser Use (5)

Desktop GUI, mobile, and browser navigation tasks

OSWorld AndroidWorld WebVoyager BrowseComp-VL DeepPlanning

Specialized (4)

Domain-specific agentic tasks across ML, research, and airline

MLE-Bench Lite VITA-Bench WideResearch Tau2-Airline

Top 15 Models by Weighted Agentic Score

OpenAIAnthropicGoogleMetaDeepSeekMistralxAIAlibaba

Core Benchmarks Radar — Top 5 Models

CSV JSON

#	Model
1	GPT-5.5 Pro OpenAI · Proprietary	90.1	—	90.1	—	100
2	GPT-5.4 Pro OpenAI · Proprietary	89.3	—	89.3	—	91
3	Holo3-35B-A3B H Company · Open Weight	82.6	—	—	82.56	75
4	Claude Mythos Preview Anthropic · Proprietary	82.4	82	86.9	79.6	99
5	GPT-5.5 OpenAI · Proprietary	81.5	82	84.4	78.7	91
6	Holo3-122B-A10B H Company · Proprietary	78.9	—	—	78.85	75
7	GPT-5.4 OpenAI · Proprietary	77	75.1	82.7	75	89
8	Claude Opus 4.7 (Adaptive) Anthropic · Proprietary	74.9	69.4	79.3	78	90
9	DeepSeek V4 Pro (Max) DeepSeek · Open Weight	74	67.9	83.4	—	88
10	Kimi K2.6 Moonshot AI · Open Weight	73.1	66.7	83.2	73.1	85
11	Claude Opus 4.6 Anthropic · Proprietary	72.6	65.4	83.7	72.7	87
12	GPT-5.3 Codex OpenAI · Proprietary	71.5	77.3	—	64.7	87
13	DeepSeek V4 Pro (High) DeepSeek · Open Weight	70	63.3	80.4	—	84
14	MiMo-V2.5-Pro Xiaomi · Proprietary	68.4	68.4	—	—	87
15	MiMo-V2.5 Xiaomi · Proprietary	65.8	65.8	—	—	71
16	GPT-5.4 mini OpenAI · Proprietary	65.6	60	—	72.1	71
17	Qwen 3.6 Max (preview) Alibaba · Proprietary	65.4	65.4	—	—	80
18	GLM-5.1 Z.AI · Open Weight	65.3	63.5	68	—	83
19	Claude Sonnet 4.6 Anthropic · Proprietary	65.1	59.1	—	72.1	83
20	DeepSeek V4 Flash (Max) DeepSeek · Open Weight	63.3	56.9	73.2	—	76
21	Claude Opus 4.5 Anthropic · Proprietary	62.5	59.3	—	66.3	77
22	Composer 2 Cursor · Proprietary	61.7	61.7	—	—	73
23	Qwen3.6 Plus Alibaba · Proprietary	61.6	61.6	—	—	73
24	Qwen3.6-27B Alibaba · Open Weight	59.3	59.3	—	—	74
25	DeepSeek V4 Pro DeepSeek · Open Weight	59.1	59.1	—	—	70
26	Muse Spark Meta · Proprietary	59	59	—	—	82
27	MiniMax M2.7 MiniMax · Open Weight	57	57	—	—	62
28	GLM-5 Z.AI · Open Weight	56.2	56.2	—	—	67
29	Qwen3.5 397B Alibaba · Open Weight	56.2	52.5	62	—	64
30	Qwen3.5-122B-A10B Alibaba · Open Weight	56.1	49.4	63.8	58	65
31	DeepSeek V4 Flash (High) DeepSeek · Open Weight	55.4	56.6	53.5	—	71
32	Claude Sonnet 4.5 Anthropic · Proprietary	55.3	50	—	61.4	66
33	GPT-5.2 OpenAI · Proprietary	55.2	—	65.8	47.3	81
34	Kimi K2.5 (Reasoning) Moonshot AI · Proprietary	54.6	50.8	60.6	—	76
35	Kimi K2.5 Moonshot AI · Open Weight	54.6	50.8	60.6	—	64
36	Hy3 Preview Tencent · Open Weight	54.4	54.4	—	—	62
37	Qwen3.5-27B Alibaba · Open Weight	51.6	41.6	61	56.2	63
38	Qwen3.6-35B-A3B Alibaba · Open Weight	51.5	51.5	—	—	67
39	Qwen3.5-35B-A3B Alibaba · Open Weight	50.6	40.5	61	54.5	56
40	DeepSeek V4 Flash DeepSeek · Open Weight	49.1	49.1	—	—	59
41	Grok 4.20 xAI · Proprietary	47.1	47.1	—	—	65
42	GLM-4.7 Z.AI · Open Weight	45.3	41	52	—	69
43	GPT-5.4 nano OpenAI · Proprietary	42.9	46.3	—	39	61
44	Laguna M.1 Poolside · Proprietary	40.7	40.7	—	—	46
45	Laguna XS.2 Poolside · Open Weight	30.1	30.1	—	—	32

Agentic score = weighted average of Terminal-Bench 2.0 (40%), OSWorld-Verified (35%), and BrowseComp (25%), normalized by available weights. This page intentionally stays on BenchLM's verified ranking lane and only includes exact-source rows. Display-only benchmarks (MCP Atlas, Toolathlon, etc.) are tracked but do not affect rankings.

Agentic Category Rankings

BenchLM category page with provisional and verified ranking modes

Coding Benchmarks

SWE-bench, LiveCodeBench, and more

Price vs Performance

Find the best value for agent workloads

Frequently Asked Questions

What are LLM agent benchmarks?

Agent benchmarks test whether AI models can go beyond answering questions and actually complete multi-step tasks: browsing the web, writing and running code in a terminal, calling external APIs via function calling, and operating desktop or mobile interfaces. They measure real-world usefulness for autonomous workflows.

What is function calling and why does it matter?

Function calling (or tool use) lets an LLM invoke external tools, APIs, or databases as part of its response. This is critical for building AI agents that can search the web, query databases, send emails, or control other software. Benchmarks like BFCL v4 and Toolathlon specifically measure how reliably models select the right function and pass correct arguments.

What is MCP (Model Context Protocol)?

MCP is an open standard for connecting LLMs to external tools and data sources. MCP Atlas and MCP-Tasks benchmark how well models work with MCP-backed integrations. Strong MCP performance means a model integrates well into tool-rich agent architectures.

Why does agentic carry the most weight in BenchLM scores?

Agentic carries 22% of BenchLM's overall score because the ability to use tools, browse, and complete multi-step tasks is the strongest differentiator between models in production use. A model that scores well on knowledge but cannot reliably call functions or navigate software has limited real-world utility for agent workflows.

Which models are best for building AI agents?

Currently, GPT-5.5 Pro by OpenAI leads BenchLM's verified agentic rankings with a score of 90.1. The best open-weight agent model is Holo3-35B-A3B (82.6). Check the leaderboard above for the full verified ranking.

The AI models change fast. We track them for you.

For engineers, researchers, and the plain curious — a weekly brief on new models, ranking shifts, and pricing changes.

Free. No spam. Unsubscribe anytime.