A stronger coding-agent benchmark than SWE-bench Verified, intended to differentiate frontier models on realistic software engineering work.
As of June 13, 2026, Claude Mythos 5 leads the SWE-bench Pro leaderboard with 80.3% , followed by Claude Fable 5 (80%) and Claude Opus 4.8 (69.2%).
Claude Mythos 5
Anthropic
Claude Fable 5
Anthropic
Claude Opus 4.8
Anthropic
According to BenchLM.ai, Claude Mythos 5 leads the SWE-bench Pro benchmark with a score of 80.3%, followed by Claude Fable 5 (80%) and Claude Opus 4.8 (69.2%). There is significant spread across the leaderboard, making this benchmark effective at differentiating model capabilities.
38 models have been evaluated on SWE-bench Pro. The benchmark falls in the Coding category. This category carries a 20% weight in BenchLM.ai's overall scoring system. Within that category, SWE-bench Pro contributes 23% of the category score, so strong performance here directly affects a model's overall ranking.
Year
2026
Tasks
Real-world software engineering
Format
Repository task completion
Difficulty
Frontier coding agent
SWE-bench Pro is the more relevant frontier signal when selecting coding agents in 2026. It reflects more realistic difficulty than the older verified subset.
Version
SWE-bench Pro 2026
Refresh cadence
Quarterly
Staleness state
Current
Question availability
Public benchmark set
BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.
A stronger coding-agent benchmark than SWE-bench Verified, intended to differentiate frontier models on realistic software engineering work.
Claude Mythos 5 by Anthropic currently leads with a score of 80.3% on SWE-bench Pro.
38 AI models have been evaluated on SWE-bench Pro on BenchLM.
For engineers, researchers, and the plain curious — a weekly brief on new models, ranking shifts, and pricing changes.
Free. No spam. Unsubscribe anytime.