Skip to main content

SWE-bench Pro

A stronger coding-agent benchmark than SWE-bench Verified, intended to differentiate frontier models on realistic software engineering work.

Top models on SWE-bench Pro — June 13, 2026

As of June 13, 2026, Claude Mythos 5 leads the SWE-bench Pro leaderboard with 80.3% , followed by Claude Fable 5 (80%) and Claude Opus 4.8 (69.2%).

38 modelsCoding23% of category scoreCurrentUpdated June 13, 2026

According to BenchLM.ai, Claude Mythos 5 leads the SWE-bench Pro benchmark with a score of 80.3%, followed by Claude Fable 5 (80%) and Claude Opus 4.8 (69.2%). There is significant spread across the leaderboard, making this benchmark effective at differentiating model capabilities.

38 models have been evaluated on SWE-bench Pro. The benchmark falls in the Coding category. This category carries a 20% weight in BenchLM.ai's overall scoring system. Within that category, SWE-bench Pro contributes 23% of the category score, so strong performance here directly affects a model's overall ranking.

About SWE-bench Pro

Year

2026

Tasks

Real-world software engineering

Format

Repository task completion

Difficulty

Frontier coding agent

SWE-bench Pro is the more relevant frontier signal when selecting coding agents in 2026. It reflects more realistic difficulty than the older verified subset.

BenchLM freshness & provenance

Version

SWE-bench Pro 2026

Refresh cadence

Quarterly

Staleness state

Current

Question availability

Public benchmark set

Current

BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.

Leaderboard (38 models)

1
80.3%
2
80%
3
69.2%
4
64.3%
5
60.6%
6
59%
7
58.6%
8
58.6%
9
58.4%
10
57.7%
11
57.6%
12
57.3%
13
57.2%
14
57.1%
15
56.8%
16
56.6%
17
56.3%
18
56.2%
19
56.1%
20
55.6%
21
55.4%
22
55.1%
23
55.1%
24
54.4%
25
53.5%
26
53.4%
27
52.8%
28
52.6%
29
52.4%
30
52.3%
31
52.1%
32
51.8%
33
50.9%
34
50.7%
35
49.5%
36
49.2%
37
49.1%
38
46.3%

FAQ

What does SWE-bench Pro measure?

A stronger coding-agent benchmark than SWE-bench Verified, intended to differentiate frontier models on realistic software engineering work.

Which model scores highest on SWE-bench Pro?

Claude Mythos 5 by Anthropic currently leads with a score of 80.3% on SWE-bench Pro.

How many models are evaluated on SWE-bench Pro?

38 AI models have been evaluated on SWE-bench Pro on BenchLM.

Last updated: June 13, 2026 · BenchLM version SWE-bench Pro 2026

The AI models change fast. We track them for you.

For engineers, researchers, and the plain curious — a weekly brief on new models, ranking shifts, and pricing changes.

Free. No spam. Unsubscribe anytime.