Which AI models handle function calling, MCP tool use, browsing, and multi-step agent workflows best? Rankings across 24 agentic benchmarks.
Agentic carries 22% weightin BenchLM.ai's overall score — the single biggest category.
GPT-5.4 Pro
Score: 87.7 · OpenAI
GLM-5 (Reasoning)
Score: 78.3 · Zhipu AI
24 benchmarks
Terminal, browsing, tool-use, and computer-use
These 3 benchmarks determine agentic rankings
Function calling, MCP tool use, and structured workflows
OpenClaw-style and end-to-end agent evaluations
Desktop GUI, mobile, and browser navigation tasks
Domain-specific agentic tasks across ML, research, and airline
| # | Model | |
|---|---|---|
| 1 | GPT-5.4 Pro OpenAI · Proprietary | 87.7 |
| 2 | GPT-5.2-Codex OpenAI · Proprietary | 87 |
| 3 | MiMo-V2-Pro Xiaomi · Proprietary | 86.7 |
| 4 | GPT-5.1-Codex-Max OpenAI · Proprietary | 86 |
| 5 | Holo3-122B-A10B H Company · Proprietary | 78.9 |
| 6 | GLM-5 (Reasoning) Zhipu AI · Open Weight | 78.3 |
| 7 | Grok 4.1 xAI · Proprietary | 78.2 |
| 8 | Gemini 3 Pro Deep Think Google · Proprietary | 78.1 |
| 9 | Holo3-35B-A3B H Company · Open Weight | 77.8 |
| 10 | GPT-5.4 OpenAI · Proprietary | 77 |
| 11 | Gemini 3.1 Pro Google · Proprietary | 76.1 |
| 12 | GPT-5.1 OpenAI · Proprietary | 75.8 |
| 13 | GPT-5 (medium) OpenAI · Proprietary | 75.5 |
| 14 | o1-preview OpenAI · Proprietary | 75.4 |
| 15 | GPT-5 (high) OpenAI · Proprietary | 75.2 |
| 16 | GPT-5.3 Codex OpenAI · Proprietary | 74.4 |
| 17 | Claude Opus 4.6 Anthropic · Proprietary | 72.6 |
| 18 | Claude Sonnet 4.6 Anthropic · Proprietary | 72.5 |
| 19 | Gemini 3 Pro Google · Proprietary | 71.1 |
| 20 | Grok 4.1 Fast xAI · Proprietary | 71 |
| 21 | o3-pro OpenAI · Proprietary | 70.4 |
| 22 | Qwen3.5 397B (Reasoning) Alibaba · Open Weight | 70 |
| 23 | o3 OpenAI · Proprietary | 69.9 |
| 24 | DeepSeek V3.2 (Thinking) DeepSeek · Open Weight | 69.4 |
| 25 | DeepSeek Coder 2.0 DeepSeek · Open Weight | 67.5 |
| 26 | o3-mini OpenAI · Proprietary | 66.6 |
| 27 | GPT-5.2 OpenAI · Proprietary | 66.2 |
| 28 | GPT-5.4 mini OpenAI · Proprietary | 65.6 |
| 29 | o1 OpenAI · Proprietary | 65.4 |
| 30 | Claude Opus 4.5 Anthropic · Proprietary | 65.2 |
| 31 | GPT-4.1 OpenAI · Proprietary | 64.7 |
| 32 | Qwen2.5-1M Alibaba · Open Weight | 64.7 |
| 33 | DeepSeekMath V2 DeepSeek · Open Weight | 63.9 |
| 34 | Nemotron 3 Ultra 500B NVIDIA · Open Weight | 62.8 |
| 35 | Qwen3.6 Plus Alibaba · Proprietary | 62 |
| 36 | MiMo-V2-Flash Xiaomi · Open Weight | 61.8 |
| 37 | Gemini 2.5 Pro Google · Proprietary | 61.7 |
| 38 | Composer 2 Cursor · Proprietary | 61.7 |
| 39 | GLM-4.7 Zhipu AI · Open Weight | 61 |
| 40 | Claude Sonnet 4.5 Anthropic · Proprietary | 60 |
| 41 | DeepSeek V3.2 DeepSeek · Open Weight | 58.8 |
| 42 | o4-mini (high) OpenAI · Proprietary | 58.5 |
| 43 | GLM-5 Zhipu AI · Open Weight | 58.3 |
| 44 | Qwen3.5 397B Alibaba · Open Weight | 58.3 |
| 45 | Grok 4 xAI · Proprietary | 58.1 |
| 46 | GLM-5V-Turbo Zhipu AI · Proprietary | 58 |
| 47 | Claude 4 Sonnet Anthropic · Proprietary | 57.9 |
| 48 | Qwen2.5-72B Alibaba · Open Weight | 57.7 |
| 49 | Kimi K2.5 (Reasoning) Moonshot AI · Proprietary | 57.6 |
| 50 | Kimi K2.5 Moonshot AI · Open Weight | 57.6 |
| 51 | Gemini 3 Flash Google · Proprietary | 57.5 |
| 52 | DeepSeek LLM 2.0 DeepSeek · Open Weight | 57 |
| 53 | MiniMax M2.7 MiniMax · Proprietary | 57 |
| 54 | Nemotron 3 Super 100B NVIDIA · Open Weight | 56.6 |
| 55 | GPT-4.1 mini OpenAI · Proprietary | 56.5 |
| 56 | Qwen3.5-122B-A10B Alibaba · Open Weight | 56 |
| 57 | Grok Code Fast 1 xAI · Proprietary | 55.7 |
| 58 | Claude 3.5 Sonnet Anthropic · Proprietary | 55 |
| 59 | Claude 4.1 Opus Thinking Anthropic · Proprietary | 54 |
| 60 | Llama 3.1 405B Meta · Open Weight | 53 |
| 61 | Claude 4.1 Opus Anthropic · Proprietary | 52.8 |
| 62 | Gemini 1.5 Pro Google · Proprietary | 52.3 |
| 63 | Mistral Large 2 Mistral · Proprietary | 52.2 |
| 64 | Kimi K2 Moonshot AI · Proprietary | 52.1 |
| 65 | Claude Haiku 4.5 Anthropic · Proprietary | 51.9 |
| 66 | Qwen3.5-27B Alibaba · Open Weight | 51.6 |
| 67 | GPT-4o mini OpenAI · Proprietary | 50.9 |
| 68 | Qwen3.5-35B-A3B Alibaba · Open Weight | 50.5 |
| 69 | Sarvam 105B Sarvam · Open Weight | 49.5 |
| 70 | Gemini 3.1 Flash-Lite Google · Proprietary | 49.2 |
| 71 | Mistral Large 3 Mistral · Proprietary | 49 |
| 72 | GPT-4o OpenAI · Proprietary | 48.5 |
| 73 | Claude 3 Opus Anthropic · Proprietary | 48.1 |
| 74 | Qwen3 235B 2507 (Reasoning) Alibaba · Open Weight | 47.4 |
| 75 | GPT-4.1 nano OpenAI · Proprietary | 47.4 |
| 76 | Nemotron Ultra 253B NVIDIA · Open Weight | 46.7 |
| 77 | Gemini 2.5 Flash Google · Proprietary | 46.5 |
| 78 | GPT-OSS 120B OpenAI · Open Weight | 44.8 |
| 79 | GPT-4 Turbo OpenAI · Proprietary | 44.7 |
| 80 | DeepSeek-R1 DeepSeek · Open Weight | 44.5 |
| 81 | DeepSeek V3.1 (Reasoning) DeepSeek · Open Weight | 44.3 |
| 82 | Claude 3 Haiku Anthropic · Proprietary | 44 |
| 83 | GPT-5.4 nano OpenAI · Proprietary | 42.9 |
| 84 | Z-1 Z · Proprietary | 42.2 |
| 85 | Moonshot v1 Moonshot AI · Proprietary | 42.2 |
| 86 | Nemotron-4 15B NVIDIA · Open Weight | 41.3 |
| 87 | Llama 3 70B Meta · Open Weight | 41.2 |
| 88 | Mistral 8x7B Mistral · Open Weight | 41.1 |
| 89 | Llama 4 Maverick Meta · Open Weight | 40.9 |
| 90 | Gemini 1.0 Pro Google · Proprietary | 39.8 |
| 91 | o1-pro OpenAI · Proprietary | 39.7 |
| 92 | Nemotron 3 Nano 30B NVIDIA · Open Weight | 39.6 |
| 93 | Llama 4 Scout Meta · Open Weight | 39 |
| 94 | Phi-4 Microsoft · Open Weight | 38.3 |
| 95 | Grok 3 [Beta] xAI · Proprietary | 35.5 |
| 96 | Sarvam 30B Sarvam · Open Weight | 35.5 |
| 97 | GPT-OSS 20B OpenAI · Open Weight | 35.4 |
| 98 | Nova Pro Amazon · Proprietary | 34.9 |
| 99 | Gemma 3 27B Google · Open Weight | 34.4 |
| 100 | DBRX Instruct Databricks · Open Weight | 34.3 |
| 101 | Qwen3 235B 2507 Alibaba · Open Weight | 33.7 |
| 102 | Llama 4 Behemoth Meta · Open Weight | 33 |
| 103 | DeepSeek V3.1 DeepSeek · Open Weight | 32.9 |
| 104 | Mixtral 8x22B Instruct v0.1 Mistral · Open Weight | 31.8 |
| 105 | GLM-4.5-Air Zhipu AI · Proprietary | 31.5 |
| 106 | GLM-4.5 Zhipu AI · Proprietary | 28 |
| 107 | Mistral 8x7B v0.2 Mistral · Open Weight | 27.8 |
| 108 | Mistral 7B v0.3 Mistral · Open Weight | 26.4 |
Agentic score = weighted average of Terminal-Bench 2.0 (40%), OSWorld-Verified (35%), and BrowseComp (25%), normalized by available weights. Display-only benchmarks (MCP Atlas, Toolathlon, etc.) are tracked but do not affect rankings.
Official BenchLM agentic leaderboard
SWE-bench, LiveCodeBench, and more
Find the best value for agent workloads
Agent benchmarks test whether AI models can go beyond answering questions and actually complete multi-step tasks: browsing the web, writing and running code in a terminal, calling external APIs via function calling, and operating desktop or mobile interfaces. They measure real-world usefulness for autonomous workflows.
Function calling (or tool use) lets an LLM invoke external tools, APIs, or databases as part of its response. This is critical for building AI agents that can search the web, query databases, send emails, or control other software. Benchmarks like BFCL v4 and Toolathlon specifically measure how reliably models select the right function and pass correct arguments.
MCP is an open standard for connecting LLMs to external tools and data sources. MCP Atlas and MCP-Tasks benchmark how well models work with MCP-backed integrations. Strong MCP performance means a model integrates well into tool-rich agent architectures.
Agentic carries 22% of BenchLM's overall score because the ability to use tools, browse, and complete multi-step tasks is the strongest differentiator between models in production use. A model that scores well on knowledge but cannot reliably call functions or navigate software has limited real-world utility for agent workflows.
Currently, GPT-5.4 Pro by OpenAI leads the weighted agentic rankings with a score of 87.7. The best open-weight agent model is GLM-5 (Reasoning) (78.3). Check the leaderboard above for the full ranking.
Get notified when new models drop, benchmark scores change, or the leaderboard shifts. One email per week.
Free. No spam. Unsubscribe anytime. We only store derived location metadata for consent routing.