SWE-bench Pro (SWE-bench Pro)

A stronger coding-agent benchmark than SWE-bench Verified, intended to differentiate frontier models on realistic software engineering work.

According to BenchLM.ai, GPT-5.3 Codex leads the SWE-bench Pro benchmark with a score of 90, followed by GPT-5.4 Pro (89) and GPT-5.2 Pro (89). The top models are clustered within 1 points, suggesting this benchmark is nearing saturation for frontier models.

121 models have been evaluated on SWE-bench Pro. The benchmark falls in the coding category, which carries a 17% weight in BenchLM.ai's overall scoring system. Strong performance here directly impacts a model's overall ranking.

About SWE-bench Pro

Year

2026

Tasks

Real-world software engineering

Format

Repository task completion

Difficulty

Frontier coding agent

SWE-bench Pro is the more relevant frontier signal when selecting coding agents in 2026. It reflects more realistic difficulty than the older verified subset.

Why we no longer evaluate SWE-bench Verified

Leaderboard (121 models)

#1GPT-5.3 Codex
90
#2GPT-5.4 Pro
89
#3GPT-5.2 Pro
89
#4GPT-5.2-Codex
86
#5GPT-5.4
85
#6GPT-5.2
85
#9GPT-5.3 Instant
83
#10GPT-5.2 Instant
77
#11Claude Opus 4.6
74
#12Grok 4.1
73
#13Gemini 3.1 Pro
72
#14GPT-5 (medium)
72
#15GPT-5.1
71
#16GPT-5 (high)
70
#17Kimi K2.5 (Reasoning)
70
#18o1-preview
69
#19GLM-5 (Reasoning)
67
#21GPT-4o mini
65
#22Claude Sonnet 4.6
64
#25Claude Opus 4.5
62
#26DeepSeek Coder 2.0
61
#27Claude Sonnet 4.5
60
#28Gemini 3 Pro
58
#30o3
58
#31o3-mini
57
#32o3-pro
55
#33Phi-4
55
#34MiMo-V2-Flash
52
#35GLM-4.7
51
#36DeepSeekMath V2
51
#37GPT-4.1
51
#38o1
50
#39GPT-5 mini
49
#40Qwen2.5-1M
49
#41Step 3.5 Flash
49
#42Grok 4
48
#43Claude 4 Sonnet
48
#44GLM-4.7-Flash
48
#45DBRX Instruct
48
#46Claude 4.1 Opus
47
#47DeepSeek V3.2
47
#49Qwen2.5-72B
47
#50GLM-5
46
#51Seed 1.6
46
#52Claude Haiku 4.5
46
#53DeepSeek LLM 2.0
46
#55Seed-2.0-Lite
45
#56Gemini 2.5 Pro
44
#57Gemini 3 Flash
44
#58Mistral Large 2
44
#60Mercury 2
43
#62o4-mini (high)
42
#63Qwen3.5 397B
42
#64Mistral Large 3
42
#66MiniMax M2.5
41
#67Kimi K2.5
40
#70Claude 3.5 Sonnet
37
#71Aion-2.0
37
#73Ministral 3 14B
34
#74Seed 1.6 Flash
31
#75GPT-OSS 120B
31
#76GPT-4.1 mini
30
#77Moonshot v1
30
#78Nemotron-4 15B
30
#79Z-1
30
#80GPT-4o
29
#82Seed-2.0-Mini
29
#85Mistral 8x7B
28
#87Gemini 2.5 Flash
25
#88DeepSeek-R1
25
#90o1-pro
23
#91GPT-5 nano
22
#93Claude 3 Opus
20
#94Nova Pro
20
#95Claude 3 Haiku
19
#96LFM2-24B-A2B
19
#97Qwen3 235B 2507
19
#98Gemini 1.5 Pro
18
#99GPT-4.1 nano
18
#100GPT-OSS 20B
18
#102Gemma 3 27B
17
#103Qwen2.5-VL-32B
17
#104MiniMax M1 80k
17
#105Llama 4 Scout
15
#107GLM-4.5
15
#108DeepSeek V3.1
15
#109GPT-4 Turbo
14
#110Llama 3 70B
14
#112GLM-4.5-Air
14
#113Mistral 8x7B v0.2
14
#114Kimi K2
13
#115Ministral 3 8B
13
#116Gemini 1.0 Pro
12
#117Mistral 7B v0.3
12
#121Ministral 3 3B
5

FAQ

What does SWE-bench Pro measure?

A stronger coding-agent benchmark than SWE-bench Verified, intended to differentiate frontier models on realistic software engineering work.

Which model scores highest on SWE-bench Pro?

GPT-5.3 Codex by OpenAI currently leads with a score of 90 on SWE-bench Pro.

How many models are evaluated on SWE-bench Pro?

121 AI models have been evaluated on SWE-bench Pro on BenchLM.

Last updated: March 12, 2026

Weekly LLM Benchmark Digest

Get notified when new models drop, benchmark scores change, or the leaderboard shifts. One email per week.

Free. No spam. Unsubscribe anytime. We only store derived location metadata for consent routing.