Software Engineering Benchmark Verified (SWE-bench Verified)

A curated, human-verified subset of SWE-bench that tests models on resolving real GitHub issues from popular open-source Python repositories like Django, Flask, and scikit-learn.

About SWE-bench Verified

Year

2024

Tasks

500 verified issues

Format

Code patch generation

Difficulty

Professional software engineering

SWE-bench Verified is the gold standard for evaluating AI coding agents on real-world software engineering tasks. Each task requires understanding codebases, writing patches, and passing test suites.

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Leaderboard (88 models)

#1GPT-5.3 Codex
85
#2GPT-5.4
81
#3GPT-5.2
80
#4Claude Opus 4.6
80
#5Grok 4.1
77
#6GPT-5.2-Codex
76
#7Gemini 3.1 Pro
75
#9Claude Sonnet 4.6
69
#10Claude Opus 4.5
68
#11GPT-5.1
68
#13GPT-5 (high)
67
#14GPT-5 (medium)
67
#15Claude Sonnet 4.5
66
#16o1-preview
65
#17Kimi K2.5 (Reasoning)
65
#18GLM-5 (Reasoning)
62
#20Gemini 3 Pro
59
#22DeepSeek Coder 2.0
51
#23Claude 4 Sonnet
51
#24o3
50
#25Mistral Large 2
49
#27Grok 4
48
#28Claude 4.1 Opus
48
#29Claude Haiku 4.5
48
#30Qwen2.5-1M
47
#31o3-pro
46
#32GLM-5
46
#33Qwen2.5-72B
46
#34DeepSeek LLM 2.0
46
#36DeepSeek V3.2
45
#37Gemini 2.5 Pro
45
#38o4-mini (high)
45
#39DeepSeekMath V2
45
#40MiMo-V2-Flash
45
#41Mistral Large 3
45
#42MiniMax M2.5
45
#43Gemini 3 Flash
44
#45GLM-4.7
43
#46Qwen3.5 397B
42
#47Kimi K2.5
42
#49GPT-5 mini
41
#50GLM-4.7-Flash
40
#51Claude 3.5 Sonnet
36
#52Moonshot v1
34
#53Z-1
33
#55Nemotron-4 15B
31
#57GPT-OSS 120B
29
#58Mistral 8x7B
28
#60Gemini 2.5 Flash
23
#62GPT-4o
20
#65Nova Pro
19
#66GLM-4.5
18
#67Claude 3 Haiku
17
#68DeepSeek-R1
17
#69Qwen2.5-VL-32B
17
#70Gemma 3 27B
16
#72MiniMax M1 80k
16
#73Mistral 8x7B v0.2
16
#75Qwen3 235B 2507
15
#76GLM-4.5-Air
15
#77Kimi K2
15
#78Mistral 7B v0.3
15
#80GPT-OSS 20B
14
#82DeepSeek V3.1
13
#83Llama 4 Scout
12
#84Claude 3 Opus
10
#85Llama 3 70B
9
#86Gemini 1.5 Pro
5
#87GPT-4 Turbo
5
#88Gemini 1.0 Pro
5

FAQ

What does SWE-bench Verified measure?

A curated, human-verified subset of SWE-bench that tests models on resolving real GitHub issues from popular open-source Python repositories like Django, Flask, and scikit-learn.

Which model scores highest on SWE-bench Verified?

GPT-5.3 Codex by OpenAI currently leads with a score of 85 on SWE-bench Verified.

How many models are evaluated on SWE-bench Verified?

88 AI models have been evaluated on SWE-bench Verified on BenchLM.