Claw-Eval

Name: Claw-Eval
Creator: BenchLM

A transparent real-world autonomous-agent benchmark with 300 human-verified tasks, 2,159 rubric items, and Pass^3 scoring across general, multi-turn, and native multimodal agent tasks.

How BenchLM shows Claw-Eval

BenchLM mirrors the official Claw-Eval 2026-05-09 leaderboard snapshot. The source benchmark contains 300 human-verified tasks, 2,159 rubric items, and uses Pass^3 as the primary metric across 3 independent trials.

The public Claw-Eval site separates the 199-task general plus multi-turn agent table from the 101-task native multimodal table. BenchLM sorts this page by the primary general plus multi-turn Pass^3 table and preserves native multimodal split scores in the mirrored snapshot metadata.

Claw-Eval is display only on BenchLM. It is strong evidence about agent reliability, but the public rows are benchmark-harness results rather than normalized model-only rankings, so they are excluded from BenchLM overall and category scores.

23 primary agent rows11 native multimodal rows300 tasks2,159 rubricsDisplay only

Claw-Eval leaderboard GitHub repository Hugging Face dataset

Pass^3 on Claw-Eval — 2026-05-09 snapshot

BenchLM mirrors the published pass^3 view for Claw-Eval. Claude Opus 4.6 leads the public snapshot at 70.4% , followed by Claude Sonnet 4.6 (67.8%) and MiMo-V2.5-Pro (63.8%). BenchLM does not use these results to rank models overall.

1Closed

Claude Opus 4.6

Anthropic

opus46

70.4%

Overall 87Context 1M

2Closed

Claude Sonnet 4.6

Anthropic

sonnet46

67.8%

Overall 83Context 200K

3Closed

MiMo-V2.5-Pro

Xiaomi

mimo_v25_pro

63.8%

Overall —Context 1M

23 modelsAgenticCurrentDisplay onlyUpdated 2026-05-09 snapshot

The published Claw-Eval snapshot is tightly clustered at the top: Claude Opus 4.6 sits at 70.4%, while the third row is only 6.6 points behind. The broader top-10 spread is 11.6 points, so the benchmark still separates strong models even when the leaders cluster.

23 models have been evaluated on Claw-Eval. The benchmark falls in the Agentic category. This category carries a 22% weight in BenchLM.ai's overall scoring system. Claw-Eval is currently displayed for reference but excluded from the scoring formula, so it does not directly affect overall rankings.

About Claw-Eval

Year

2026

Tasks

300 tasks, 2,159 rubrics

Format

End-to-end autonomous-agent evaluation with Pass^3 scoring

Difficulty

Real-world general, multi-turn, and native multimodal agent execution

Claw-Eval v1.1.0 evaluates autonomous agents on full-trajectory tasks audited for completion, safety, and robustness. Its primary Pass^3 metric requires a task to pass in all three independent trials, reducing lucky-run effects. BenchLM mirrors the official leaderboard as display-only because rows reflect benchmark harness execution as well as model capability.

Claw-Eval: Towards Trustworthy Evaluation of Autonomous Agents Public benchmark source

BenchLM freshness & provenance

Version

Claw-Eval 2026

Refresh cadence

Quarterly

Staleness state

Current

Question availability

Public benchmark set

CurrentDisplay only

BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.

Pass^3 table (23 models)

Claude Opus 4.6opus46

AnthropicClosed

70.4%

Claude Sonnet 4.6sonnet46

AnthropicClosed

67.8%

MiMo-V2.5-Promimo_v25_pro

XiaomiClosed

63.8%

Muse Sparkmuse_spark

MetaClosed

63.8%

Kimi K2.6kimi_k26

Moonshot AIOpen

62.3%

MiMo-V2.5mimo_v25

XiaomiClosed

62.3%

GLM-5.1glm51

Z.AIOpen

62.3%

GPT-5.4gpt54

OpenAIClosed

60.3%

DeepSeek V4 Prodeepseek_v4_pro

DeepSeekOpen

59.8%

Qwen3.6 Plusqwen3.6_plus

AlibabaClosed

58.8%

Gemini 3.1 Progemini31_pro

GoogleClosed

57.8%

DeepSeek V4 Flashdeepseek_v4_flash

DeepSeekOpen

57.8%

MiMo-V2-Promimo_v2_pro

XiaomiClosed

57.8%

Qwen3.5 397Bqwen3.5-397b-a17b

AlibabaOpen

56.8%

GLM-5-Turboglm5_turbo

Z.AIClosed

55.8%

GLM-5V-Turboglm5v_turbo

Z.AIClosed

53.8%

Kimi K2.5kimi_k25

Moonshot AIOpen

52.3%

Agnes-2.0-flashagnes_20_flash

SapiensAI

51.8%

Gemini 3 Flashgemini3_flash

GoogleClosed

49.2%

MiniMax M2.7minimax_m27

MiniMaxOpen

48.7%

MiMo-V2-Omnimimo_v2_omni

XiaomiClosed

45.2%

DeepSeek V3.2deepseek_v32

DeepSeekOpen

40.2%

Nemotron 3 Super 100Bnemotron3_super

NVIDIAOpen

5.5%

FAQ

What does Claw-Eval measure?

A transparent real-world autonomous-agent benchmark with 300 human-verified tasks, 2,159 rubric items, and Pass^3 scoring across general, multi-turn, and native multimodal agent tasks.

Which model leads the published Claw-Eval snapshot?

Claude Opus 4.6 currently leads the published Claw-Eval snapshot with a pass^3 of 70.4%. BenchLM shows this benchmark for display only and does not use it in overall rankings.

How many models are evaluated on Claw-Eval?

23 AI models are included in BenchLM's mirrored Claw-Eval snapshot, based on the public leaderboard captured on 2026-05-09 snapshot.

Last updated: 2026-05-09 snapshot · mirrored from the public benchmark leaderboard

The AI models change fast. We track them for you.

For engineers, researchers, and the plain curious — a weekly brief on new models, ranking shifts, and pricing changes.

Free. No spam. Unsubscribe anytime.