Skip to main content

Software Engineering Benchmark Verified (SWE-bench Verified)

A curated, human-verified subset of SWE-bench that tests models on resolving real GitHub issues from popular open-source Python repositories like Django, Flask, and scikit-learn.

Top models on SWE-bench Verified — June 2, 2026

As of June 2, 2026, Claude Mythos Preview leads the SWE-bench Verified leaderboard with 93.9% , followed by Claude Opus 4.8 (88.6%) and Claude Opus 4.7 (Adaptive) (87.6%).

49 modelsCoding13% of category scoreRefreshingUpdated June 2, 2026

According to BenchLM.ai, Claude Mythos Preview leads the SWE-bench Verified benchmark with a score of 93.9%, followed by Claude Opus 4.8 (88.6%) and Claude Opus 4.7 (Adaptive) (87.6%). The scores show moderate spread, with meaningful differences between the top tier and mid-tier models.

49 models have been evaluated on SWE-bench Verified. The benchmark falls in the Coding category. This category carries a 20% weight in BenchLM.ai's overall scoring system. Within that category, SWE-bench Verified contributes 13% of the category score, so strong performance here directly affects a model's overall ranking.

About SWE-bench Verified

Year

2024

Tasks

500 verified issues

Format

Code patch generation

Difficulty

Professional software engineering

SWE-bench Verified is the gold standard for evaluating AI coding agents on real-world software engineering tasks. Each task requires understanding codebases, writing patches, and passing test suites.

BenchLM freshness & provenance

Version

SWE-bench Verified 2024

Refresh cadence

Annual

Staleness state

Refreshing

Question availability

Public benchmark set

Refreshing

BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.

Leaderboard (49 models)

1
93.9%
2
88.6%
3
87.6%
4
85%
5
80.9%
6
80.8%
7
80.6%
8
80.5%
9
80.4%
10
80.2%
11
80%
12
79.6%
13
79.4%
14
79%
15
78.8%
16
78.6%
17
78%
18
77.8%
19
77.6%
20
77.4%
21
77.2%
22
77.2%
23
76.8%
24
76.8%
25
76.7%
26
76.2%
27
74.8%
28
74.6%
29
74.5%
30
74.4%
31
73.8%
32
73.7%
33
73.6%
34
73.4%
35
73.4%
36
73.3%
37
72.7%
38
72.4%
39
72%
40
70.8%
41
69.9%
42
69.2%
43
63.8%
44
54.6%
45
53.2%
46
49.3%
47
49%
48
42%
49
23.6%

FAQ

What does SWE-bench Verified measure?

A curated, human-verified subset of SWE-bench that tests models on resolving real GitHub issues from popular open-source Python repositories like Django, Flask, and scikit-learn.

Which model scores highest on SWE-bench Verified?

Claude Mythos Preview by Anthropic currently leads with a score of 93.9% on SWE-bench Verified.

How many models are evaluated on SWE-bench Verified?

49 AI models have been evaluated on SWE-bench Verified on BenchLM.

Last updated: June 2, 2026 · BenchLM version SWE-bench Verified 2024

The AI models change fast. We track them for you.

For engineers, researchers, and the plain curious — a weekly brief on new models, ranking shifts, and pricing changes.

Free. No spam. Unsubscribe anytime.