SWE-bench Pro

Name: SWE-bench Pro
Creator: BenchLM

A stronger coding-agent benchmark than SWE-bench Verified, intended to differentiate frontier models on realistic software engineering work.

Top models on SWE-bench Pro — April 29, 2026

As of April 29, 2026, Claude Mythos Preview leads the SWE-bench Pro leaderboard with 77.8% , followed by Claude Opus 4.7 (Adaptive) (64.3%) and GPT-5.5 (58.6%).

1Closed

Claude Mythos Preview

Anthropic

77.8%

Overall 99Context 1M

2Closed

Claude Opus 4.7 (Adaptive)

Anthropic

64.3%

Overall 90Context 1M

3Closed

GPT-5.5

OpenAI

58.6%

Overall 91Context 1M

30 modelsCoding23% of category scoreCurrentUpdated April 29, 2026

According to BenchLM.ai, Claude Mythos Preview leads the SWE-bench Pro benchmark with a score of 77.8%, followed by Claude Opus 4.7 (Adaptive) (64.3%) and GPT-5.5 (58.6%). There is significant spread across the leaderboard, making this benchmark effective at differentiating model capabilities.

30 models have been evaluated on SWE-bench Pro. The benchmark falls in the Coding category. This category carries a 20% weight in BenchLM.ai's overall scoring system. Within that category, SWE-bench Pro contributes 23% of the category score, so strong performance here directly affects a model's overall ranking.

About SWE-bench Pro

Year

2026

Tasks

Real-world software engineering

Format

Repository task completion

Difficulty

Frontier coding agent

SWE-bench Pro is the more relevant frontier signal when selecting coding agents in 2026. It reflects more realistic difficulty than the older verified subset.

Why we no longer evaluate SWE-bench Verified

BenchLM freshness & provenance

Version

SWE-bench Pro 2026

Refresh cadence

Quarterly

Staleness state

Current

Question availability

Public benchmark set

Current

BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.