Agent & Tool-Use Benchmarks

Which AI models handle function calling, MCP tool use, browsing, and multi-step agent workflows best? Rankings across 24 agentic benchmarks.

Agentic carries 22% weightin BenchLM.ai's overall score — the single biggest category.

Best Agentic Model

GPT-5.4 Pro

Score: 87.7 · OpenAI

Best Open-Weight Agent

GLM-5 (Reasoning)

Score: 78.3 · Zhipu AI

Benchmarks Tracked

24 benchmarks

Terminal, browsing, tool-use, and computer-use

Benchmark Categories

Core Weighted (3)

These 3 benchmarks determine agentic rankings

Tool Calling & MCP (6)

Function calling, MCP tool use, and structured workflows

Agent Frameworks (6)

OpenClaw-style and end-to-end agent evaluations

Computer & Browser Use (5)

Desktop GUI, mobile, and browser navigation tasks

Specialized (4)

Domain-specific agentic tasks across ML, research, and airline

Top 15 Models by Weighted Agentic Score

OpenAIAnthropicGoogleMetaDeepSeekMistralxAIAlibaba

Core Benchmarks Radar — Top 5 Models

CSVJSON
#Model
1GPT-5.4 Pro

OpenAI · Proprietary

87.7
2GPT-5.2-Codex

OpenAI · Proprietary

87
3MiMo-V2-Pro

Xiaomi · Proprietary

86.7
4GPT-5.1-Codex-Max

OpenAI · Proprietary

86
5Holo3-122B-A10B

H Company · Proprietary

78.9
6GLM-5 (Reasoning)

Zhipu AI · Open Weight

78.3
7Grok 4.1

xAI · Proprietary

78.2
8Gemini 3 Pro Deep Think

Google · Proprietary

78.1
9Holo3-35B-A3B

H Company · Open Weight

77.8
10GPT-5.4

OpenAI · Proprietary

77
11Gemini 3.1 Pro

Google · Proprietary

76.1
12GPT-5.1

OpenAI · Proprietary

75.8
13GPT-5 (medium)

OpenAI · Proprietary

75.5
14o1-preview

OpenAI · Proprietary

75.4
15GPT-5 (high)

OpenAI · Proprietary

75.2
16GPT-5.3 Codex

OpenAI · Proprietary

74.4
17Claude Opus 4.6

Anthropic · Proprietary

72.6
18Claude Sonnet 4.6

Anthropic · Proprietary

72.5
19Gemini 3 Pro

Google · Proprietary

71.1
20Grok 4.1 Fast

xAI · Proprietary

71
21o3-pro

OpenAI · Proprietary

70.4
22Qwen3.5 397B (Reasoning)

Alibaba · Open Weight

70
23o3

OpenAI · Proprietary

69.9
24DeepSeek V3.2 (Thinking)

DeepSeek · Open Weight

69.4
25DeepSeek Coder 2.0

DeepSeek · Open Weight

67.5
26o3-mini

OpenAI · Proprietary

66.6
27GPT-5.2

OpenAI · Proprietary

66.2
28GPT-5.4 mini

OpenAI · Proprietary

65.6
29o1

OpenAI · Proprietary

65.4
30Claude Opus 4.5

Anthropic · Proprietary

65.2
31GPT-4.1

OpenAI · Proprietary

64.7
32Qwen2.5-1M

Alibaba · Open Weight

64.7
33DeepSeekMath V2

DeepSeek · Open Weight

63.9
34Nemotron 3 Ultra 500B

NVIDIA · Open Weight

62.8
35Qwen3.6 Plus

Alibaba · Proprietary

62
36MiMo-V2-Flash

Xiaomi · Open Weight

61.8
37Gemini 2.5 Pro

Google · Proprietary

61.7
38Composer 2

Cursor · Proprietary

61.7
39GLM-4.7

Zhipu AI · Open Weight

61
40Claude Sonnet 4.5

Anthropic · Proprietary

60
41DeepSeek V3.2

DeepSeek · Open Weight

58.8
42o4-mini (high)

OpenAI · Proprietary

58.5
43GLM-5

Zhipu AI · Open Weight

58.3
44Qwen3.5 397B

Alibaba · Open Weight

58.3
45Grok 4

xAI · Proprietary

58.1
46GLM-5V-Turbo

Zhipu AI · Proprietary

58
47Claude 4 Sonnet

Anthropic · Proprietary

57.9
48Qwen2.5-72B

Alibaba · Open Weight

57.7
49Kimi K2.5 (Reasoning)

Moonshot AI · Proprietary

57.6
50Kimi K2.5

Moonshot AI · Open Weight

57.6
51Gemini 3 Flash

Google · Proprietary

57.5
52DeepSeek LLM 2.0

DeepSeek · Open Weight

57
53MiniMax M2.7

MiniMax · Proprietary

57
54Nemotron 3 Super 100B

NVIDIA · Open Weight

56.6
55GPT-4.1 mini

OpenAI · Proprietary

56.5
56Qwen3.5-122B-A10B

Alibaba · Open Weight

56
57Grok Code Fast 1

xAI · Proprietary

55.7
58Claude 3.5 Sonnet

Anthropic · Proprietary

55
59Claude 4.1 Opus Thinking

Anthropic · Proprietary

54
60Llama 3.1 405B

Meta · Open Weight

53
61Claude 4.1 Opus

Anthropic · Proprietary

52.8
62Gemini 1.5 Pro

Google · Proprietary

52.3
63Mistral Large 2

Mistral · Proprietary

52.2
64Kimi K2

Moonshot AI · Proprietary

52.1
65Claude Haiku 4.5

Anthropic · Proprietary

51.9
66Qwen3.5-27B

Alibaba · Open Weight

51.6
67GPT-4o mini

OpenAI · Proprietary

50.9
68Qwen3.5-35B-A3B

Alibaba · Open Weight

50.5
69Sarvam 105B

Sarvam · Open Weight

49.5
70Gemini 3.1 Flash-Lite

Google · Proprietary

49.2
71Mistral Large 3

Mistral · Proprietary

49
72GPT-4o

OpenAI · Proprietary

48.5
73Claude 3 Opus

Anthropic · Proprietary

48.1
74Qwen3 235B 2507 (Reasoning)

Alibaba · Open Weight

47.4
75GPT-4.1 nano

OpenAI · Proprietary

47.4
76Nemotron Ultra 253B

NVIDIA · Open Weight

46.7
77Gemini 2.5 Flash

Google · Proprietary

46.5
78GPT-OSS 120B

OpenAI · Open Weight

44.8
79GPT-4 Turbo

OpenAI · Proprietary

44.7
80DeepSeek-R1

DeepSeek · Open Weight

44.5
81DeepSeek V3.1 (Reasoning)

DeepSeek · Open Weight

44.3
82Claude 3 Haiku

Anthropic · Proprietary

44
83GPT-5.4 nano

OpenAI · Proprietary

42.9
84Z-1

Z · Proprietary

42.2
85Moonshot v1

Moonshot AI · Proprietary

42.2
86Nemotron-4 15B

NVIDIA · Open Weight

41.3
87Llama 3 70B

Meta · Open Weight

41.2
88Mistral 8x7B

Mistral · Open Weight

41.1
89Llama 4 Maverick

Meta · Open Weight

40.9
90Gemini 1.0 Pro

Google · Proprietary

39.8
91o1-pro

OpenAI · Proprietary

39.7
92Nemotron 3 Nano 30B

NVIDIA · Open Weight

39.6
93Llama 4 Scout

Meta · Open Weight

39
94Phi-4

Microsoft · Open Weight

38.3
95Grok 3 [Beta]

xAI · Proprietary

35.5
96Sarvam 30B

Sarvam · Open Weight

35.5
97GPT-OSS 20B

OpenAI · Open Weight

35.4
98Nova Pro

Amazon · Proprietary

34.9
99Gemma 3 27B

Google · Open Weight

34.4
100DBRX Instruct

Databricks · Open Weight

34.3
101Qwen3 235B 2507

Alibaba · Open Weight

33.7
102Llama 4 Behemoth

Meta · Open Weight

33
103DeepSeek V3.1

DeepSeek · Open Weight

32.9
104Mixtral 8x22B Instruct v0.1

Mistral · Open Weight

31.8
105GLM-4.5-Air

Zhipu AI · Proprietary

31.5
106GLM-4.5

Zhipu AI · Proprietary

28
107Mistral 8x7B v0.2

Mistral · Open Weight

27.8
108Mistral 7B v0.3

Mistral · Open Weight

26.4

Agentic score = weighted average of Terminal-Bench 2.0 (40%), OSWorld-Verified (35%), and BrowseComp (25%), normalized by available weights. Display-only benchmarks (MCP Atlas, Toolathlon, etc.) are tracked but do not affect rankings.

Frequently Asked Questions

What are LLM agent benchmarks?

Agent benchmarks test whether AI models can go beyond answering questions and actually complete multi-step tasks: browsing the web, writing and running code in a terminal, calling external APIs via function calling, and operating desktop or mobile interfaces. They measure real-world usefulness for autonomous workflows.

What is function calling and why does it matter?

Function calling (or tool use) lets an LLM invoke external tools, APIs, or databases as part of its response. This is critical for building AI agents that can search the web, query databases, send emails, or control other software. Benchmarks like BFCL v4 and Toolathlon specifically measure how reliably models select the right function and pass correct arguments.

What is MCP (Model Context Protocol)?

MCP is an open standard for connecting LLMs to external tools and data sources. MCP Atlas and MCP-Tasks benchmark how well models work with MCP-backed integrations. Strong MCP performance means a model integrates well into tool-rich agent architectures.

Why does agentic carry the most weight in BenchLM scores?

Agentic carries 22% of BenchLM's overall score because the ability to use tools, browse, and complete multi-step tasks is the strongest differentiator between models in production use. A model that scores well on knowledge but cannot reliably call functions or navigate software has limited real-world utility for agent workflows.

Which models are best for building AI agents?

Currently, GPT-5.4 Pro by OpenAI leads the weighted agentic rankings with a score of 87.7. The best open-weight agent model is GLM-5 (Reasoning) (78.3). Check the leaderboard above for the full ranking.

Weekly LLM Benchmark Digest

Get notified when new models drop, benchmark scores change, or the leaderboard shifts. One email per week.

Free. No spam. Unsubscribe anytime. We only store derived location metadata for consent routing.