Skip to main content

Claw-Eval

A transparent real-world autonomous-agent benchmark with 300 human-verified tasks, 2,159 rubric items, and Pass^3 scoring across general, multi-turn, and native multimodal agent tasks.

How BenchLM shows Claw-Eval

BenchLM mirrors the official Claw-Eval 2026-05-09 leaderboard snapshot. The source benchmark contains 300 human-verified tasks, 2,159 rubric items, and uses Pass^3 as the primary metric across 3 independent trials.

The public Claw-Eval site separates the 199-task general plus multi-turn agent table from the 101-task native multimodal table. BenchLM sorts this page by the primary general plus multi-turn Pass^3 table and preserves native multimodal split scores in the mirrored snapshot metadata.

Claw-Eval is display only on BenchLM. It is strong evidence about agent reliability, but the public rows are benchmark-harness results rather than normalized model-only rankings, so they are excluded from BenchLM overall and category scores.

23 primary agent rows11 native multimodal rows300 tasks2,159 rubricsDisplay only

Pass^3 on Claw-Eval — 2026-05-09 snapshot

BenchLM mirrors the published pass^3 view for Claw-Eval. Claude Opus 4.6 leads the public snapshot at 70.4% , followed by Claude Sonnet 4.6 (67.8%) and MiMo-V2.5-Pro (63.8%). BenchLM does not use these results to rank models overall.

23 modelsAgenticCurrentDisplay onlyUpdated 2026-05-09 snapshot

The published Claw-Eval snapshot is tightly clustered at the top: Claude Opus 4.6 sits at 70.4%, while the third row is only 6.6 points behind. The broader top-10 spread is 11.6 points, so the benchmark still separates strong models even when the leaders cluster.

23 models have been evaluated on Claw-Eval. The benchmark falls in the Agentic category. This category carries a 22% weight in BenchLM.ai's overall scoring system. Claw-Eval is currently displayed for reference but excluded from the scoring formula, so it does not directly affect overall rankings.

About Claw-Eval

Year

2026

Tasks

300 tasks, 2,159 rubrics

Format

End-to-end autonomous-agent evaluation with Pass^3 scoring

Difficulty

Real-world general, multi-turn, and native multimodal agent execution

Claw-Eval v1.1.0 evaluates autonomous agents on full-trajectory tasks audited for completion, safety, and robustness. Its primary Pass^3 metric requires a task to pass in all three independent trials, reducing lucky-run effects. BenchLM mirrors the official leaderboard as display-only because rows reflect benchmark harness execution as well as model capability.

BenchLM freshness & provenance

Version

Claw-Eval 2026

Refresh cadence

Quarterly

Staleness state

Current

Question availability

Public benchmark set

CurrentDisplay only

BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.

Pass^3 table (23 models)

1
70.4%
2
67.8%
3
MiMo-V2.5-Promimo_v25_pro
63.8%
4
Muse Sparkmuse_spark
63.8%
5
Kimi K2.6kimi_k26
62.3%
6
MiMo-V2.5mimo_v25
62.3%
7
GLM-5.1glm51
62.3%
8
GPT-5.4gpt54
60.3%
9
DeepSeek V4 Prodeepseek_v4_pro
59.8%
10
Qwen3.6 Plusqwen3.6_plus
58.8%
11
Gemini 3.1 Progemini31_pro
57.8%
12
DeepSeek V4 Flashdeepseek_v4_flash
57.8%
13
MiMo-V2-Promimo_v2_pro
57.8%
14
Qwen3.5 397Bqwen3.5-397b-a17b
56.8%
15
GLM-5-Turboglm5_turbo
55.8%
16
GLM-5V-Turboglm5v_turbo
53.8%
17
Kimi K2.5kimi_k25
52.3%
18
Agnes-2.0-flashagnes_20_flash
51.8%
19
Gemini 3 Flashgemini3_flash
49.2%
20
MiniMax M2.7minimax_m27
48.7%
21
MiMo-V2-Omnimimo_v2_omni
45.2%
22
DeepSeek V3.2deepseek_v32
40.2%
23
Nemotron 3 Super 100Bnemotron3_super
5.5%

FAQ

What does Claw-Eval measure?

A transparent real-world autonomous-agent benchmark with 300 human-verified tasks, 2,159 rubric items, and Pass^3 scoring across general, multi-turn, and native multimodal agent tasks.

Which model leads the published Claw-Eval snapshot?

Claude Opus 4.6 currently leads the published Claw-Eval snapshot with a pass^3 of 70.4%. BenchLM shows this benchmark for display only and does not use it in overall rankings.

How many models are evaluated on Claw-Eval?

23 AI models are included in BenchLM's mirrored Claw-Eval snapshot, based on the public leaderboard captured on 2026-05-09 snapshot.

Last updated: 2026-05-09 snapshot · mirrored from the public benchmark leaderboard

The AI models change fast. We track them for you.

For engineers, researchers, and the plain curious — a weekly brief on new models, ranking shifts, and pricing changes.

Free. No spam. Unsubscribe anytime.