What is an agentic LLM benchmark?

Agentic benchmarks evaluate whether AI models can complete multi-step workflows using tools, browsers, terminals, or software interfaces instead of only answering in chat.

Which benchmarks matter for AI agents?

Key agentic benchmarks include Terminal-Bench 2.0 for terminal tasks, BrowseComp for web research, and OSWorld-Verified for computer-use workflows.

Why do agentic benchmarks matter in 2026?

Agentic benchmarks matter because many modern products rely on models that can browse, plan, use tools, and complete end-to-end tasks rather than only generate text.

Agentic

Agentic Benchmarks — Terminal, Browsing & Computer Use Leaderboard

Name: Agentic Benchmarks — LLM Leaderboard
Creator: BenchLM.ai

Tool use, browser research, and computer-use workflows

Bottom line: Claude Mythos Preview has a perfect agentic score, but GPT-5.4 is close behind and significantly cheaper for production workloads.

Terminal-Bench 2.0 · BrowseComp · OSWorld-Verified · CyberGym · BrowseComp-VL · OSWorld · AndroidWorld · WebVoyager · MCP Atlas · Toolathlon · ZClawBench · Tau2-Telecom · DeepSearchQA · Tau2-Airline · PinchBench · BFCL v4 · MLE-Bench Lite · MM-ClawBench

Terminal/tool useBrowser researchComputer use

Best Agentic picks

BenchLM summaries for agentic plus the practical tradeoffs users check next: open weights, price, speed, latency, and context.

How BenchLM scores these

Best Agentic

Claude Mythos Preview

100

category score

Anthropic

Best Open Weight

DeepSeek V4 Pro (Max)

overall score

DeepSeek

Cheapest

Qwen3.6-27B

$0.00

avg / 1M tokens

Alibaba

Fastest

Mercury 2

789

tokens / sec

Inception

Lowest Latency

LFM2-24B-A2B

0.42s

TTFT

LiquidAI

Largest Context

Nemotron 3 Ultra 500B

10M

context window

NVIDIA

Top AI Models for Agentic — April 2026

As of April 2026, Claude Mythos Preview leads the provisional agentic leaderboard with a weighted score of 100.0%, followed by GPT-5.5 (98.2%) and Gemini 3 Pro Deep Think (95.4%). BenchLM is currently showing 99 provisional-ranked models and 9 verified-ranked models in this category.

1Proprietary

Claude Mythos Preview

Anthropic

100.0%weighted

Perfect agentic score. Dominates terminal and browser tasks. Higher cost and latency.

Terminal-Bench 2.0 82BrowseComp 86.9OSWorld-Verified 79.6

2Proprietary

GPT-5.5

OpenAI

98.2%weighted

Terminal-Bench 2.0 82BrowseComp 84.4OSWorld-Verified 78.7

3Proprietary

Gemini 3 Pro Deep Think

Google

95.4%weighted

99 provisional-ranked9 verified-ranked18 benchmarksUpdated April 29, 2026

What changed

Claude Mythos Preview debuted at #1 with a 100.0 weighted agentic score — the first model to achieve this.

GPT-5.4 holds #2 at 93.5%, strong on both Terminal-Bench and BrowseComp.

Claude Opus 4.6 remains #3 at 92.6%, with the most consistent scores across all agentic sub-benchmarks.

How to choose

Building autonomous agents?

Claude Mythos Preview — best agentic model available

Cost-sensitive production agents?

GPT-5.4 — 93.5% at lower cost

Need reliability over peak performance?

Claude Opus 4.6 — most balanced across sub-tasks

Open-weight agentic model?

GLM-5 (Reasoning) — best open-weight for agents

Top models by benchmark

Agentic software engineering and terminal task completion benchmark(28% of category score)

1Claude Mythos Preview

2GPT-5.5

3GPT-5.3 Codex

~77.3

4GPT-5.4

75.1

5Claude Opus 4.7 (Adaptive)

69.4

Agentic AI Leaderboard

Updated April 29, 2026

Sorted by agentic weighted score. Switch between provisional-ranked and verified-ranked modes to see the broader public dataset versus sourced-only ranking. Click column headers to re-sort by overall score or any benchmark.

99 ranked models

CSV JSON

Provisional-ranked mode includes source-unverified non-generated benchmark evidence.P = provisional benchmark row


1 Claude Mythos Preview Anthropic	Closed	Reasoning	1M	$25.00 / $125.00	N/A	N/A	100%	99	82%	86.9%	79.6%	83.1%	—	—	—	—	—	—	—	—	—	—	—	—	—	—
2 GPT-5.5 OpenAI	Closed	Reasoning	1M	$5.00 / $30.00	N/A	N/A	98.2%	91	82%	84.4%	78.7%	81.8%	—	—	—	—	75.3%	55.6%	—	98%	—	—	—	—	—	—
3 Gemini 3 Pro Deep Think Google	Closed	Reasoning	2M	N/A	N/A	N/A	95.4%	Est.90	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—
4 Claude Opus 4.7 (Adaptive) Anthropic	Closed	Reasoning	1M	$5.00 / $25.00	N/A	N/A	94.3%	90	69.4%	79.3%	78%	73.1%	—	—	—	—	77.3%	—	—	—	—	—	—	—	—	—
5 GPT-5.4 Pro OpenAI	Closed	Reasoning	1.05M	$30.00 / $180.00	74	151.79s	91.7%	91	—	89.3%	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—
6 o1-preview OpenAI	Closed	Reasoning	200K	$15.00 / $60.00	N/A	N/A	90.2%	Est.83	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—
7 Kimi K2.6 Moonshot AI Self-host	Open	Reasoning	256K	$0.95 / $4.00	N/A	N/A	87.9%	84	66.7%	83.2%	73.1%	—	—	—	—	—	55.9%	50%	—	—	92.5%	—	—	—	—	—
8 GPT-5.4 OpenAI	Closed	Reasoning	1.05M	$2.50 / $15.00	74	151.79s	87.6%	89	75.1%	82.7%	75%	79.0%	—	—	—	—	70.6%	54.6%	—	92.8%	73.6%	—	—	—	—	—
9 Gemini 3.1 Pro Google	Closed	Standard	1M	$2.00 / $12.00	109	29.71s	87.1%	92	—	—	—	—	—	—	—	—	—	—	—	95.6%	69.7%	—	—	—	—	—
10 Claude Opus 4.6 Anthropic	Closed	Standard	1M	$5.00 / $25.00	40	1.78s	84.7%	87	65.4%	83.7%	74%	66.6%	—	—	—	—	—	—	—	—	73.7%	—	—	—	—	—
11 GLM-5 (Reasoning) Z.AI Self-host	Open	Reasoning	200K	$1.00 / $3.20	N/A	N/A	82.4%	Est.82	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—
12 Claude Sonnet 4.6 Anthropic	Closed	Standard	200K	$3.00 / $15.00	44	1.48s	81.4%	83	59.1%	—	72.1%	65.2%	—	—	—	—	—	—	—	—	—	—	—	—	—	—
13 GPT-5.3 Codex OpenAI	Closed	Reasoning	400K	$1.75 / $14.00	79	88.26s	80.3%	Est.87	77.3%	—	64.7%	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—
14 Grok 4.1 xAI	Closed	Standard	1M	N/A	N/A	N/A	79%	Est.90	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—
15 GPT-5 (high) OpenAI	Closed	Reasoning	128K	$1.25 / $10.00	83	36.28s	78.4%	Est.78	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—
16 GPT-5.1 OpenAI	Closed	Reasoning	200K	$1.25 / $10.00	111	57.47s	77.1%	Est.79	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—
17 Claude Opus 4.5 Anthropic	Closed	Standard	200K	$5.00 / $25.00	46	1.01s	76.6%	77	59.3%	—	66.3%	50.6%	—	66.3%	—	—	42.3%	43.5%	—	—	—	—	—	—	—	—
18 Gemini 3 Pro Google	Closed	Standard	2M	$2.00 / $12.00	109	32.65s	73.1%	81	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—
19 Qwen3.5 397B (Reasoning) Alibaba Self-host	Open	Reasoning	128K	$0.60 / $3.60	N/A	N/A	72.3%	Est.79	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—
20 Qwen3.6 Plus Alibaba	Closed	Reasoning	1M	N/A	N/A	N/A	70.9%	74	61.6%	—	—	—	—	—	—	—	48.2%	39.8%	—	—	—	—	—	—	—	—
21 GPT-5 (medium) OpenAI	Closed	Reasoning	128K	N/A	83	36.28s	70.9%	Est.72	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—
22 Kimi K2.5 (Reasoning) Moonshot AI Self-host	Closed	Reasoning	128K	$0.60 / $3.00	N/A	N/A	65.3%	76	50.8%	60.6%	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—
23 DeepSeek V3.2 (Thinking) DeepSeek Self-host	Open	Reasoning	128K	$0.55 / $2.19	N/A	N/A	64.7%	Est.62	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—
24 o3-mini OpenAI	Closed	Reasoning	200K	$1.10 / $4.40	160	7.12s	64.2%	Est.56	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—
25 Grok 4.1 Fast xAI	Closed	Standard	1M	$0.20 / $0.50	138	0.54s	63.6%	Est.70	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—

Showing 25 of 99

These rankings update weekly

Get notified when models move. One email a week with what changed and why.

Free. No spam. Unsubscribe anytime.

Score in Context

What these scores mean

Agentic carries the highest weight at 22% in BenchLM.ai's overall scoring — reflecting that browse-and-do workflows now matter more than raw chat fluency. The weighted score blends Terminal-Bench 2.0, BrowseComp, and OSWorld-Verified. A 5-point gap means the difference between an agent that reliably completes multi-step tasks and one that stalls midway.

Known limitations

Agentic benchmarks are newer and less standardized than coding or knowledge tests. Terminal-Bench and BrowseComp use different evaluation harnesses, so cross-benchmark comparison requires care. Some models lack agentic benchmark data entirely and are excluded from rankings rather than estimated.

How we weight

Agentic capability carries a 22% weight in BenchLM.ai's overall scoring — the single biggest contributor, reflecting that browse-and-do workflows now matter more than raw chat fluency.

Agentic benchmarks test whether an AI model can do work, not just talk about it — opening tools, gathering evidence, navigating software, and staying coherent over long action chains. See the agentic leaderboard or compare with coding benchmarks.

Leaderboards exclude benchmark rows that BenchLM generated from other scores or cloned from reference models. When a weighted benchmark is missing after that filter, the category falls back to the remaining trustworthy public rows instead of filling the gap with synthetic values.

The full scoring rules, freshness handling, and runtime/pricing caveats live on the BenchLM methodology page.

Benchmark	Weight	Status	Description
Terminal-Bench 2.0	28%	Weighted	Agentic software engineering and terminal task completion benchmark
BrowseComp	18%	Weighted	Web research benchmark for browsing agents
OSWorld-Verified	24%	Weighted	Computer-use benchmark for GUI task completion
CyberGym	—	Display only	Cybersecurity task benchmark for evaluating defensive cyber workflows and vulnerability-oriented agent performance.
BrowseComp-VL	—	Display only	Vision-language browsing benchmark for multimodal web research and tool-use tasks.
OSWorld	—	Display only	Computer-use benchmark for GUI task completion across the broader OSWorld task suite.
AndroidWorld	—	Display only	Android GUI agent benchmark for task completion across mobile app workflows.
WebVoyager	—	Display only	Browser agent benchmark for completing multi-step workflows on live websites.
MCP Atlas	—	Display only	Tool-calling benchmark for Model Context Protocol integrations and multi-tool coordination
Toolathlon	—	Display only	General tool-calling benchmark for multi-step API and tool usage
ZClawBench	—	Display only	Z.AI's OpenClaw workflow benchmark for broad agent tasks across research, office work, data analysis, devops, automation, and security.
Tau2-Telecom	—	Display only	Telecom-focused tool-use benchmark for structured API workflows
DeepSearchQA	—	Display only	Agentic browsing benchmark for list-style question answering with browser tools.
Tau2-Airline	—	Display only	Airline-domain tool-use benchmark for structured workflow execution and API correctness.
PinchBench	—	Display only	An OpenClaw agent benchmark from Kilo that measures successful task completion across standardized real-world agent workflows.
BFCL v4	—	Display only	Function-calling benchmark for tool selection, schema adherence, and argument correctness.
MLE-Bench Lite	—	Display only	A lightweight machine-learning competition benchmark that measures whether models can iteratively train, evaluate, and improve ML systems in low-resource settings.
MM-ClawBench	—	Display only	An OpenClaw-derived agent benchmark covering practical work and life tasks such as office document delivery, research, planning, and code maintenance.

Agentic benchmark updates

Agentic is the fastest-moving category. Don't fall behind.

Free. No spam. Unsubscribe anytime.

About Agentic Benchmarks

Agentic software engineering and terminal task completion benchmark

Best LLMs Overall

Top models ranked across all benchmark categories.

View

Coding Benchmarks

How models perform on SWE-bench and LiveCodeBench.

View

Best Open-Weight Models

Top open-source models for agentic workloads.

View

AI Cost Calculator

Compare pricing for agentic model usage.

View

Agentic Benchmarks — Terminal, Browsing & Computer Use Leaderboard

Best Agentic picks

Top AI Models for Agentic — April 2026

What changed

How to choose

Top models by benchmark

Agentic AI Leaderboard

These rankings update weekly

Score in Context

What these scores mean

Known limitations

How we weight

Agentic benchmark updates

About Agentic Benchmarks

Related

Stay ahead of the LLM curve