A stronger coding-agent benchmark than SWE-bench Verified, intended to differentiate frontier models on realistic software engineering work.
As of April 10, 2026, Claude Mythos Preview leads the SWE-bench Pro leaderboard with 77.8% , followed by GLM-5.1 (58.4%) and GPT-5.4 (57.7%).
Claude Mythos Preview
Anthropic
GLM-5.1
Z.AI
GPT-5.4
OpenAI
According to BenchLM.ai, Claude Mythos Preview leads the SWE-bench Pro benchmark with a score of 77.8%, followed by GLM-5.1 (58.4%) and GPT-5.4 (57.7%). There is significant spread across the leaderboard, making this benchmark effective at differentiating model capabilities.
14 models have been evaluated on SWE-bench Pro. The benchmark falls in the Coding category. This category carries a 20% weight in BenchLM.ai's overall scoring system. Within that category, SWE-bench Pro contributes 23% of the category score, so strong performance here directly affects a model's overall ranking.
Year
2026
Tasks
Real-world software engineering
Format
Repository task completion
Difficulty
Frontier coding agent
SWE-bench Pro is the more relevant frontier signal when selecting coding agents in 2026. It reflects more realistic difficulty than the older verified subset.
Version
SWE-bench Pro 2026
Refresh cadence
Quarterly
Staleness state
Current
Question availability
Public benchmark set
BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.
A stronger coding-agent benchmark than SWE-bench Verified, intended to differentiate frontier models on realistic software engineering work.
Claude Mythos Preview by Anthropic currently leads with a score of 77.8% on SWE-bench Pro.
14 AI models have been evaluated on SWE-bench Pro on BenchLM.
For engineers, researchers, and the plain curious — a weekly brief on new models, ranking shifts, and pricing changes.
Free. No spam. Unsubscribe anytime.