Skip to main content
Skip to main content
Agentic

Agentic Benchmarks — Terminal, Browsing & Computer Use Leaderboard

Tool use, browser research, and computer-use workflows

Bottom line: Claude Mythos Preview has a perfect agentic score, but GPT-5.4 is close behind and significantly cheaper for production workloads.

Terminal-Bench 2.0 · BrowseComp · OSWorld-Verified · CyberGym · BrowseComp-VL · OSWorld · AndroidWorld · WebVoyager · MCP Atlas · Toolathlon · ZClawBench · Tau2-Telecom · DeepSearchQA · Tau2-Airline · PinchBench · BFCL v4 · MLE-Bench Lite · MM-ClawBench

Terminal/tool useBrowser researchComputer use

Best Agentic picks

BenchLM summaries for agentic plus the practical tradeoffs users check next: open weights, price, speed, latency, and context.

Top AI Models for AgenticApril 2026

As of April 2026, Claude Mythos Preview leads the provisional agentic leaderboard with a weighted score of 100.0%, followed by GPT-5.5 (99.5%) and Gemini 3 Pro Deep Think (95.7%). BenchLM is currently showing 99 provisional-ranked models and 9 verified-ranked models in this category.

What changed

Claude Mythos Preview debuted at #1 with a 100.0 weighted agentic score — the first model to achieve this.

GPT-5.4 holds #2 at 93.5%, strong on both Terminal-Bench and BrowseComp.

Claude Opus 4.6 remains #3 at 92.6%, with the most consistent scores across all agentic sub-benchmarks.

How to choose

Top models by benchmark

Agentic software engineering and terminal task completion benchmark(28% of category score)

Agentic AI Leaderboard

Updated April 24, 2026

Sorted by agentic weighted score. Switch between provisional-ranked and verified-ranked modes to see the broader public dataset versus sourced-only ranking. Click column headers to re-sort by overall score or any benchmark.

99 ranked models
CSVJSON
Provisional-ranked mode includes source-unverified non-generated benchmark evidence.P = provisional benchmark row
100%
99
82%86.9%79.6%
2
GPT-5.5
OpenAI
99.5%
93
82.7%84.4%78.7%81.8%75.3%55.6%98%
95.7%
Est.90
94.6%
90
69.4%79.3%78%73.1%77.3%
91.7%
91
89.3%
6
90.6%
Est.83
7
88.3%
85
66.7%83.2%73.1%55.9%50%92.5%
8
GPT-5.4
OpenAI
87.9%
89
75.1%82.7%75%79.0%70.6%54.6%92.8%73.6%
87.2%
92
95.6%69.7%
10
85.1%
87
65.4%83.7%74%73.7%
11
GLM-5 (Reasoning)
Z.AI
Self-host
82.8%
Est.83
81.9%
83
59.1%72.5%
80.7%
Est.88
77.3%64.7%
79.3%
Est.90
15
78.8%
Est.78
16
GPT-5.1
OpenAI
77.3%
Est.79
17
76.9%
77
59.3%66.3%66.3%42.3%43.5%
18
73.2%
81
19
Qwen3.5 397B (Reasoning)
Alibaba
Self-host
72.7%
Est.79
71.3%
Est.72
21
71.2%
74
61.6%48.2%39.8%
65.6%
77
50.8%60.6%
65.2%
Est.63
24
o3-mini
OpenAI
64.7%
Est.56
64.1%
Est.70
Showing 25 of 99

These rankings update weekly

Get notified when models move. One email a week with what changed and why.

Free. No spam. Unsubscribe anytime.

Score in Context

What these scores mean

Agentic carries the highest weight at 22% in BenchLM.ai's overall scoring — reflecting that browse-and-do workflows now matter more than raw chat fluency. The weighted score blends Terminal-Bench 2.0, BrowseComp, and OSWorld-Verified. A 5-point gap means the difference between an agent that reliably completes multi-step tasks and one that stalls midway.

Known limitations

Agentic benchmarks are newer and less standardized than coding or knowledge tests. Terminal-Bench and BrowseComp use different evaluation harnesses, so cross-benchmark comparison requires care. Some models lack agentic benchmark data entirely and are excluded from rankings rather than estimated.

How we weight

Agentic capability carries a 22% weight in BenchLM.ai's overall scoring — the single biggest contributor, reflecting that browse-and-do workflows now matter more than raw chat fluency.

Agentic benchmarks test whether an AI model can do work, not just talk about it — opening tools, gathering evidence, navigating software, and staying coherent over long action chains. See the agentic leaderboard or compare with coding benchmarks.

Leaderboards exclude benchmark rows that BenchLM generated from other scores or cloned from reference models. When a weighted benchmark is missing after that filter, the category falls back to the remaining trustworthy public rows instead of filling the gap with synthetic values.

The full scoring rules, freshness handling, and runtime/pricing caveats live on the BenchLM methodology page.

BenchmarkWeightStatusDescription
Terminal-Bench 2.028%WeightedAgentic software engineering and terminal task completion benchmark
BrowseComp18%WeightedWeb research benchmark for browsing agents
OSWorld-Verified24%WeightedComputer-use benchmark for GUI task completion
CyberGymDisplay onlyCybersecurity task benchmark for evaluating defensive cyber workflows and vulnerability-oriented agent performance.
BrowseComp-VLDisplay onlyVision-language browsing benchmark for multimodal web research and tool-use tasks.
OSWorldDisplay onlyComputer-use benchmark for GUI task completion across the broader OSWorld task suite.
AndroidWorldDisplay onlyAndroid GUI agent benchmark for task completion across mobile app workflows.
WebVoyagerDisplay onlyBrowser agent benchmark for completing multi-step workflows on live websites.
MCP AtlasDisplay onlyTool-calling benchmark for Model Context Protocol integrations and multi-tool coordination
ToolathlonDisplay onlyGeneral tool-calling benchmark for multi-step API and tool usage
ZClawBenchDisplay onlyZ.AI's OpenClaw workflow benchmark for broad agent tasks across research, office work, data analysis, devops, automation, and security.
Tau2-TelecomDisplay onlyTelecom-focused tool-use benchmark for structured API workflows
DeepSearchQADisplay onlyAgentic browsing benchmark for list-style question answering with browser tools.
Tau2-AirlineDisplay onlyAirline-domain tool-use benchmark for structured workflow execution and API correctness.
PinchBenchDisplay onlyAn OpenClaw agent benchmark from Kilo that measures successful task completion across standardized real-world agent workflows.
BFCL v4Display onlyFunction-calling benchmark for tool selection, schema adherence, and argument correctness.
MLE-Bench LiteDisplay onlyA lightweight machine-learning competition benchmark that measures whether models can iteratively train, evaluate, and improve ML systems in low-resource settings.
MM-ClawBenchDisplay onlyAn OpenClaw-derived agent benchmark covering practical work and life tasks such as office document delivery, research, planning, and code maintenance.

Agentic benchmark updates

Agentic is the fastest-moving category. Don't fall behind.

Free. No spam. Unsubscribe anytime.

About Agentic Benchmarks

Agentic software engineering and terminal task completion benchmark

Related