Agentic

Agentic Benchmarks

Tool use, browser research, and computer-use workflows

Terminal-Bench 2.0 · BrowseComp · OSWorld-Verified

Agentic benchmarks test whether an AI model can do work, not just talk about it. That means opening tools, gathering evidence, navigating software, and staying coherent over a long chain of actions.

BenchLM.ai tracks three agentic benchmarks with different strengths: Terminal-Bench 2.0 focuses on coding and terminal workflows, BrowseComp measures web research ability, and OSWorld-Verified probes computer-use reliability.

Agentic capability now carries a 22% weight in BenchLM.ai's overall scoring. It is still the single biggest contributor in the overall ranking, reflecting the view that browse-and-do workflows now matter more than raw chat fluency alone.

123 models
1
GPT-5.4 Pro
OpenAI
ClosedReasoning1.05M9190%88%84%
2
GPT-5.2 Pro
OpenAI
ClosedReasoning400K9088%88%82%
3
GPT-5.4
OpenAI
ClosedReasoning1.05M9090%88%85%
4
GPT-5.3 Codex
OpenAI
ClosedReasoning400K8990%88%86%
5
GPT-5.2
OpenAI
ClosedReasoning400K8890%84%81%
6
GPT-5.3 Instant
OpenAI
ClosedReasoning128K8786%82%80%
7
GPT-5.3-Codex-Spark
OpenAI
ClosedReasoning256K8790%82%83%
8
Claude Opus 4.6
Anthropic
ClosedStandard1M8580%85%74%
9
GPT-5.2 Instant
OpenAI
ClosedReasoning128K8583%82%74%
10
GPT-5.2-Codex
OpenAI
ClosedReasoning400K8590%85%85%
11
Gemini 3.1 Pro
Google
ClosedStandard1M8477%86%68%
12
GPT-5.1-Codex-Max
OpenAI
ClosedReasoning400K8490%85%82%
13
Grok 4.1
xAI
ClosedStandard1M8479%79%73%
14
Gemini 3 Pro Deep Think
Google
ClosedReasoning2M8177%87%73%
15
GPT-5.1
OpenAI
ClosedReasoning200K8078%79%71%
16
GPT-5 (high)
OpenAI
ClosedReasoning128K7978%75%72%
17
Claude Sonnet 4.6
Anthropic
ClosedStandard200K7870%77%68%
18
GLM-5 (Reasoning)
Zhipu AI
OpenReasoning200K7881%80%74%
19
GPT-5 (medium)
OpenAI
ClosedReasoning128K7877%78%72%
20
Claude Opus 4.5
Anthropic
ClosedStandard200K7771%73%68%
21
Gemini 3 Pro
Google
ClosedStandard2M7768%83%66%
22
o1-preview
OpenAI
ClosedReasoning200K7777%79%71%
23
Claude Sonnet 4.5
Anthropic
ClosedStandard200K7669%74%69%
24
Grok 4.1 Fast
xAI
ClosedStandard1M7674%73%66%
25
Kimi K2.5 (Reasoning)
Moonshot AI
ClosedReasoning128K7675%77%68%
Showing 25 of 123

About Agentic Benchmarks

Agentic software engineering and terminal task completion benchmark