A curated, human-verified subset of SWE-bench that tests models on resolving real GitHub issues from popular open-source Python repositories like Django, Flask, and scikit-learn.
As of April 21, 2026, Claude Mythos Preview leads the SWE-bench Verified leaderboard with 93.9% , followed by Claude Opus 4.7 (87.6%) and GPT-5.3 Codex (85%).
Claude Mythos Preview
Anthropic
Claude Opus 4.7
Anthropic
GPT-5.3 Codex
OpenAI
According to BenchLM.ai, Claude Mythos Preview leads the SWE-bench Verified benchmark with a score of 93.9%, followed by Claude Opus 4.7 (87.6%) and GPT-5.3 Codex (85%). The scores show moderate spread, with meaningful differences between the top tier and mid-tier models.
34 models have been evaluated on SWE-bench Verified. The benchmark falls in the Coding category. This category carries a 20% weight in BenchLM.ai's overall scoring system. Within that category, SWE-bench Verified contributes 13% of the category score, so strong performance here directly affects a model's overall ranking.
Year
2024
Tasks
500 verified issues
Format
Code patch generation
Difficulty
Professional software engineering
SWE-bench Verified is the gold standard for evaluating AI coding agents on real-world software engineering tasks. Each task requires understanding codebases, writing patches, and passing test suites.
Version
SWE-bench Verified 2024
Refresh cadence
Annual
Staleness state
Refreshing
Question availability
Public benchmark set
BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.
A curated, human-verified subset of SWE-bench that tests models on resolving real GitHub issues from popular open-source Python repositories like Django, Flask, and scikit-learn.
Claude Mythos Preview by Anthropic currently leads with a score of 93.9% on SWE-bench Verified.
34 AI models have been evaluated on SWE-bench Verified on BenchLM.
For engineers, researchers, and the plain curious — a weekly brief on new models, ranking shifts, and pricing changes.
Free. No spam. Unsubscribe anytime.