A transparent real-world autonomous-agent benchmark with 300 human-verified tasks, 2,159 rubric items, and Pass^3 scoring across general, multi-turn, and native multimodal agent tasks.
BenchLM mirrors the official Claw-Eval 2026-05-09 leaderboard snapshot. The source benchmark contains 300 human-verified tasks, 2,159 rubric items, and uses Pass^3 as the primary metric across 3 independent trials.
The public Claw-Eval site separates the 199-task general plus multi-turn agent table from the 101-task native multimodal table. BenchLM sorts this page by the primary general plus multi-turn Pass^3 table and preserves native multimodal split scores in the mirrored snapshot metadata.
Claw-Eval is display only on BenchLM. It is strong evidence about agent reliability, but the public rows are benchmark-harness results rather than normalized model-only rankings, so they are excluded from BenchLM overall and category scores.
BenchLM mirrors the published pass^3 view for Claw-Eval. Claude Opus 4.6 leads the public snapshot at 70.4% , followed by Claude Sonnet 4.6 (67.8%) and MiMo-V2.5-Pro (63.8%). BenchLM does not use these results to rank models overall.
Claude Opus 4.6
Anthropic
opus46
Claude Sonnet 4.6
Anthropic
sonnet46
MiMo-V2.5-Pro
Xiaomi
mimo_v25_pro
The published Claw-Eval snapshot is tightly clustered at the top: Claude Opus 4.6 sits at 70.4%, while the third row is only 6.6 points behind. The broader top-10 spread is 11.6 points, so the benchmark still separates strong models even when the leaders cluster.
23 models have been evaluated on Claw-Eval. The benchmark falls in the Agentic category. This category carries a 22% weight in BenchLM.ai's overall scoring system. Claw-Eval is currently displayed for reference but excluded from the scoring formula, so it does not directly affect overall rankings.
Year
2026
Tasks
300 tasks, 2,159 rubrics
Format
End-to-end autonomous-agent evaluation with Pass^3 scoring
Difficulty
Real-world general, multi-turn, and native multimodal agent execution
Claw-Eval v1.1.0 evaluates autonomous agents on full-trajectory tasks audited for completion, safety, and robustness. Its primary Pass^3 metric requires a task to pass in all three independent trials, reducing lucky-run effects. BenchLM mirrors the official leaderboard as display-only because rows reflect benchmark harness execution as well as model capability.
Version
Claw-Eval 2026
Refresh cadence
Quarterly
Staleness state
Current
Question availability
Public benchmark set
BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.
A transparent real-world autonomous-agent benchmark with 300 human-verified tasks, 2,159 rubric items, and Pass^3 scoring across general, multi-turn, and native multimodal agent tasks.
Claude Opus 4.6 currently leads the published Claw-Eval snapshot with a pass^3 of 70.4%. BenchLM shows this benchmark for display only and does not use it in overall rankings.
23 AI models are included in BenchLM's mirrored Claw-Eval snapshot, based on the public leaderboard captured on 2026-05-09 snapshot.
For engineers, researchers, and the plain curious — a weekly brief on new models, ranking shifts, and pricing changes.
Free. No spam. Unsubscribe anytime.