Agentic Benchmarks
Tool use, browser research, and computer-use workflows
Terminal-Bench 2.0 · BrowseComp · OSWorld-Verified
Agentic benchmarks test whether an AI model can do work, not just talk about it. That means opening tools, gathering evidence, navigating software, and staying coherent over a long chain of actions.
BenchLM.ai tracks three agentic benchmarks with different strengths: Terminal-Bench 2.0 focuses on coding and terminal workflows, BrowseComp measures web research ability, and OSWorld-Verified probes computer-use reliability.
Agentic capability now carries a 22% weight in BenchLM.ai's overall scoring. It is still the single biggest contributor in the overall ranking, reflecting the view that browse-and-do workflows now matter more than raw chat fluency alone.
1 GPT-5.4 Pro OpenAI | Closed | Reasoning | 1.05M | 91 | 90% | 88% | 84% |
2 GPT-5.2 Pro OpenAI | Closed | Reasoning | 400K | 90 | 88% | 88% | 82% |
3 GPT-5.4 OpenAI | Closed | Reasoning | 1.05M | 90 | 90% | 88% | 85% |
4 GPT-5.3 Codex OpenAI | Closed | Reasoning | 400K | 89 | 90% | 88% | 86% |
5 GPT-5.2 OpenAI | Closed | Reasoning | 400K | 88 | 90% | 84% | 81% |
6 GPT-5.3 Instant OpenAI | Closed | Reasoning | 128K | 87 | 86% | 82% | 80% |
7 GPT-5.3-Codex-Spark OpenAI | Closed | Reasoning | 256K | 87 | 90% | 82% | 83% |
8 Claude Opus 4.6 Anthropic | Closed | Standard | 1M | 85 | 80% | 85% | 74% |
9 GPT-5.2 Instant OpenAI | Closed | Reasoning | 128K | 85 | 83% | 82% | 74% |
10 GPT-5.2-Codex OpenAI | Closed | Reasoning | 400K | 85 | 90% | 85% | 85% |
11 Gemini 3.1 Pro Google | Closed | Standard | 1M | 84 | 77% | 86% | 68% |
12 GPT-5.1-Codex-Max OpenAI | Closed | Reasoning | 400K | 84 | 90% | 85% | 82% |
13 Grok 4.1 xAI | Closed | Standard | 1M | 84 | 79% | 79% | 73% |
14 Gemini 3 Pro Deep Think Google | Closed | Reasoning | 2M | 81 | 77% | 87% | 73% |
15 GPT-5.1 OpenAI | Closed | Reasoning | 200K | 80 | 78% | 79% | 71% |
16 GPT-5 (high) OpenAI | Closed | Reasoning | 128K | 79 | 78% | 75% | 72% |
17 Claude Sonnet 4.6 Anthropic | Closed | Standard | 200K | 78 | 70% | 77% | 68% |
18 GLM-5 (Reasoning) Zhipu AI | Open | Reasoning | 200K | 78 | 81% | 80% | 74% |
19 GPT-5 (medium) OpenAI | Closed | Reasoning | 128K | 78 | 77% | 78% | 72% |
20 Claude Opus 4.5 Anthropic | Closed | Standard | 200K | 77 | 71% | 73% | 68% |
21 Gemini 3 Pro Google | Closed | Standard | 2M | 77 | 68% | 83% | 66% |
22 o1-preview OpenAI | Closed | Reasoning | 200K | 77 | 77% | 79% | 71% |
23 Claude Sonnet 4.5 Anthropic | Closed | Standard | 200K | 76 | 69% | 74% | 69% |
24 Grok 4.1 Fast xAI | Closed | Standard | 1M | 76 | 74% | 73% | 66% |
25 Kimi K2.5 (Reasoning) Moonshot AI | Closed | Reasoning | 128K | 76 | 75% | 77% | 68% |
About Agentic Benchmarks
Agentic software engineering and terminal task completion benchmark