Agentic Benchmarks — Terminal, Browsing & Computer Use Leaderboard
Tool use, browser research, and computer-use workflows
Bottom line: Claude Mythos Preview has a perfect agentic score, but GPT-5.4 is close behind and significantly cheaper for production workloads.
Terminal-Bench 2.0 · BrowseComp · OSWorld-Verified · CyberGym · BrowseComp-VL · OSWorld · AndroidWorld · WebVoyager · MCP Atlas · Toolathlon · ZClawBench · Tau2-Telecom · DeepSearchQA · Tau2-Airline · PinchBench · BFCL v4 · MLE-Bench Lite · MM-ClawBench
Best Agentic picks
BenchLM summaries for agentic plus the practical tradeoffs users check next: open weights, price, speed, latency, and context.
Claude Mythos Preview
100
category score
Anthropic
DeepSeek V4 Pro (Max)
87
overall score
DeepSeek
Qwen3.6-27B
$0.00
avg / 1M tokens
Alibaba
Mercury 2
789
tokens / sec
Inception
LFM2-24B-A2B
0.42s
TTFT
LiquidAI
Nemotron 3 Ultra 500B
10M
context window
NVIDIA
Top AI Models for Agentic — April 2026
As of April 2026, Claude Mythos Preview leads the provisional agentic leaderboard with a weighted score of 100.0%, followed by GPT-5.5 (99.5%) and Gemini 3 Pro Deep Think (95.7%). BenchLM is currently showing 99 provisional-ranked models and 9 verified-ranked models in this category.
Claude Mythos Preview
Anthropic
Perfect agentic score. Dominates terminal and browser tasks. Higher cost and latency.
GPT-5.5
OpenAI
Gemini 3 Pro Deep Think
What changed
Claude Mythos Preview debuted at #1 with a 100.0 weighted agentic score — the first model to achieve this.
GPT-5.4 holds #2 at 93.5%, strong on both Terminal-Bench and BrowseComp.
Claude Opus 4.6 remains #3 at 92.6%, with the most consistent scores across all agentic sub-benchmarks.
How to choose
Building autonomous agents?
Claude Mythos Preview — best agentic model available
Cost-sensitive production agents?
GPT-5.4 — 93.5% at lower cost
Need reliability over peak performance?
Claude Opus 4.6 — most balanced across sub-tasks
Open-weight agentic model?
GLM-5 (Reasoning) — best open-weight for agents
Top models by benchmark
Agentic software engineering and terminal task completion benchmark(28% of category score)
Agentic AI Leaderboard
Updated April 24, 2026Sorted by agentic weighted score. Switch between provisional-ranked and verified-ranked modes to see the broader public dataset versus sourced-only ranking. Click column headers to re-sort by overall score or any benchmark.
1 Claude Mythos Preview Anthropic | 100% | 99 | 82% | 86.9% | 79.6% | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — |
2 GPT-5.5 OpenAI | 99.5% | 93 | 82.7% | 84.4% | 78.7% | 81.8% | — | — | — | — | 75.3% | 55.6% | — | 98% | — | — | — | — | — | — |
3 Gemini 3 Pro Deep Think Google | 95.7% | Est.90 | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — |
4 Claude Opus 4.7 (Adaptive) Anthropic | 94.6% | 90 | 69.4% | 79.3% | 78% | 73.1% | — | — | — | — | 77.3% | — | — | — | — | — | — | — | — | — |
5 GPT-5.4 Pro OpenAI | 91.7% | 91 | — | 89.3% | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — |
6 o1-preview OpenAI | 90.6% | Est.83 | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — |
| 88.3% | 85 | 66.7% | 83.2% | 73.1% | — | — | — | — | — | 55.9% | 50% | — | — | 92.5% | — | — | — | — | — | |
8 GPT-5.4 OpenAI | 87.9% | 89 | 75.1% | 82.7% | 75% | 79.0% | — | — | — | — | 70.6% | 54.6% | — | 92.8% | 73.6% | — | — | — | — | — |
9 Gemini 3.1 Pro Google | 87.2% | 92 | — | — | — | — | — | — | — | — | — | — | — | 95.6% | 69.7% | — | — | — | — | — |
10 Claude Opus 4.6 Anthropic | 85.1% | 87 | 65.4% | 83.7% | 74% | — | — | — | — | — | — | — | — | — | 73.7% | — | — | — | — | — |
11 | 82.8% | Est.83 | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — |
12 Claude Sonnet 4.6 Anthropic | 81.9% | 83 | 59.1% | — | 72.5% | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — |
13 GPT-5.3 Codex OpenAI | 80.7% | Est.88 | 77.3% | — | 64.7% | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — |
14 Grok 4.1 xAI | 79.3% | Est.90 | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — |
15 GPT-5 (high) OpenAI | 78.8% | Est.78 | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — |
16 GPT-5.1 OpenAI | 77.3% | Est.79 | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — |
17 Claude Opus 4.5 Anthropic | 76.9% | 77 | 59.3% | — | 66.3% | — | — | 66.3% | — | — | 42.3% | 43.5% | — | — | — | — | — | — | — | — |
18 Gemini 3 Pro Google | 73.2% | 81 | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — |
19 | 72.7% | Est.79 | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — |
20 GPT-5 (medium) OpenAI | 71.3% | Est.72 | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — |
21 Qwen3.6 Plus Alibaba | 71.2% | 74 | 61.6% | — | — | — | — | — | — | — | 48.2% | 39.8% | — | — | — | — | — | — | — | — |
22 | 65.6% | 77 | 50.8% | 60.6% | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — |
23 | 65.2% | Est.63 | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — |
24 o3-mini OpenAI | 64.7% | Est.56 | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — |
25 | 64.1% | Est.70 | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — |
These rankings update weekly
Get notified when models move. One email a week with what changed and why.
Free. No spam. Unsubscribe anytime.
Score in Context
What these scores mean
Agentic carries the highest weight at 22% in BenchLM.ai's overall scoring — reflecting that browse-and-do workflows now matter more than raw chat fluency. The weighted score blends Terminal-Bench 2.0, BrowseComp, and OSWorld-Verified. A 5-point gap means the difference between an agent that reliably completes multi-step tasks and one that stalls midway.
Known limitations
Agentic benchmarks are newer and less standardized than coding or knowledge tests. Terminal-Bench and BrowseComp use different evaluation harnesses, so cross-benchmark comparison requires care. Some models lack agentic benchmark data entirely and are excluded from rankings rather than estimated.
How we weight
Agentic capability carries a 22% weight in BenchLM.ai's overall scoring — the single biggest contributor, reflecting that browse-and-do workflows now matter more than raw chat fluency.
Agentic benchmarks test whether an AI model can do work, not just talk about it — opening tools, gathering evidence, navigating software, and staying coherent over long action chains. See the agentic leaderboard or compare with coding benchmarks.
Leaderboards exclude benchmark rows that BenchLM generated from other scores or cloned from reference models. When a weighted benchmark is missing after that filter, the category falls back to the remaining trustworthy public rows instead of filling the gap with synthetic values.
The full scoring rules, freshness handling, and runtime/pricing caveats live on the BenchLM methodology page.
| Benchmark | Weight | Status | Description |
|---|---|---|---|
| Terminal-Bench 2.0 | 28% | Weighted | Agentic software engineering and terminal task completion benchmark |
| BrowseComp | 18% | Weighted | Web research benchmark for browsing agents |
| OSWorld-Verified | 24% | Weighted | Computer-use benchmark for GUI task completion |
| CyberGym | — | Display only | Cybersecurity task benchmark for evaluating defensive cyber workflows and vulnerability-oriented agent performance. |
| BrowseComp-VL | — | Display only | Vision-language browsing benchmark for multimodal web research and tool-use tasks. |
| OSWorld | — | Display only | Computer-use benchmark for GUI task completion across the broader OSWorld task suite. |
| AndroidWorld | — | Display only | Android GUI agent benchmark for task completion across mobile app workflows. |
| WebVoyager | — | Display only | Browser agent benchmark for completing multi-step workflows on live websites. |
| MCP Atlas | — | Display only | Tool-calling benchmark for Model Context Protocol integrations and multi-tool coordination |
| Toolathlon | — | Display only | General tool-calling benchmark for multi-step API and tool usage |
| ZClawBench | — | Display only | Z.AI's OpenClaw workflow benchmark for broad agent tasks across research, office work, data analysis, devops, automation, and security. |
| Tau2-Telecom | — | Display only | Telecom-focused tool-use benchmark for structured API workflows |
| DeepSearchQA | — | Display only | Agentic browsing benchmark for list-style question answering with browser tools. |
| Tau2-Airline | — | Display only | Airline-domain tool-use benchmark for structured workflow execution and API correctness. |
| PinchBench | — | Display only | An OpenClaw agent benchmark from Kilo that measures successful task completion across standardized real-world agent workflows. |
| BFCL v4 | — | Display only | Function-calling benchmark for tool selection, schema adherence, and argument correctness. |
| MLE-Bench Lite | — | Display only | A lightweight machine-learning competition benchmark that measures whether models can iteratively train, evaluate, and improve ML systems in low-resource settings. |
| MM-ClawBench | — | Display only | An OpenClaw-derived agent benchmark covering practical work and life tasks such as office document delivery, research, planning, and code maintenance. |
Agentic benchmark updates
Agentic is the fastest-moving category. Don't fall behind.
Free. No spam. Unsubscribe anytime.
About Agentic Benchmarks
Agentic software engineering and terminal task completion benchmark