A stronger coding-agent benchmark than SWE-bench Verified, intended to differentiate frontier models on realistic software engineering work.
According to BenchLM.ai, GPT-5.3 Codex leads the SWE-bench Pro benchmark with a score of 90, followed by GPT-5.4 Pro (89) and GPT-5.2 Pro (89). The top models are clustered within 1 points, suggesting this benchmark is nearing saturation for frontier models.
121 models have been evaluated on SWE-bench Pro. The benchmark falls in the coding category, which carries a 17% weight in BenchLM.ai's overall scoring system. Strong performance here directly impacts a model's overall ranking.
Year
2026
Tasks
Real-world software engineering
Format
Repository task completion
Difficulty
Frontier coding agent
SWE-bench Pro is the more relevant frontier signal when selecting coding agents in 2026. It reflects more realistic difficulty than the older verified subset.
Why we no longer evaluate SWE-bench VerifiedA stronger coding-agent benchmark than SWE-bench Verified, intended to differentiate frontier models on realistic software engineering work.
GPT-5.3 Codex by OpenAI currently leads with a score of 90 on SWE-bench Pro.
121 AI models have been evaluated on SWE-bench Pro on BenchLM.
Get notified when new models drop, benchmark scores change, or the leaderboard shifts. One email per week.
Free. No spam. Unsubscribe anytime. We only store derived location metadata for consent routing.