Skip to main content
Skip to main content
Coding

AI Coding Benchmarks — SWE-bench & LiveCodeBench Leaderboard

Programming and software development

Bottom line: Claude Mythos Preview dominates SWE-bench Pro, but GPT-5.3 Codex is the strongest open-weight alternative for cost-sensitive teams.

HumanEval · SWE-bench Verified · LiveCodeBench · LiveCodeBench Pro · FLTEval · SWE-bench Pro · SWE-Rebench · SWE Multilingual · Multi-SWE Bench · VIBE-Pro · NL2Repo · Vibe Code Bench · React Native Evals · SWE-bench Verified*

Best Coding picks

BenchLM summaries for coding plus the practical tradeoffs users check next: open weights, price, speed, latency, and context.

Top AI Models for CodingApril 2026

As of April 2026, Claude Mythos Preview leads the provisional coding leaderboard with a weighted score of 100.0%, followed by Claude Opus 4.7 (Adaptive) (95.2%) and Gemini 3.1 Pro (93.5%). BenchLM is currently showing 95 provisional-ranked models and 15 verified-ranked models in this category.

What changed

GPT-5.3 Codex jumped to #2 on SWE-bench Pro with a 77.3 score, the highest open-weight coding result ever.

Claude Mythos Preview entered as the new coding leader with a perfect 100.0 weighted score.

GPT-5.4 now #3, overtaking Claude Opus 4.6 on LiveCodeBench.

How to choose

Top models by benchmark

Real-world GitHub issues from popular Python repos, human-verified subset(13% of category score)

SWE-bench Pro & LiveCodeBench Leaderboard

Updated April 29, 2026

Sorted by coding weighted score. Switch between provisional-ranked and verified-ranked modes to see the broader public dataset versus sourced-only ranking. Click column headers to re-sort by overall score or any benchmark.

95 ranked models
CSVJSON
Provisional-ranked mode includes source-unverified non-generated benchmark evidence.P = provisional benchmark row
100%
99
93.9%77.8%
95.2%
90
87.6%64.3%
93.5%
92
82.9%32.03%78.9%
90.4%
88
80.6%93.5%55.4%76.2%49.93%
5
GPT-5.4
OpenAI
89.3%
89
87.5%57.7%67.42%85.3%
88.7%
Est.87
85%56.8%58.2%61.77%
7
88.7%
84
80.2%89.6%58.6%76.7%37.89%
87.1%
84
79.4%89.8%54.4%74.1%
9
86.9%
87
80.8%70.7%53.4%65.3%57.57%84.1%75.6%
86.8%
76
76.8%17.54%
83.9%
71
78.6%88.4%52.3%70.2%
83.5%
83
58.4%62.7%42.7%31.46%
83.4%
Est.90
83.3%
76
79%91.6%52.6%73.3%
83.2%
83
79.6%60.7%51.48%80.6%
16
82.6%
64
76.8%85%50.7%58.5%73%77.2%70.8%
17
GPT-5.2
OpenAI
82.5%
81
80%55.6%53.50%
18
80.5%
Est.83
80%
74
77.2%83.9%53.5%71.3%36.2%
79.5%
Est.72
79.4%
Est.78
37.91%
79.3%
Est.66
77.2%
23
78.8%
Est.65
72.6%
24
GPT-5.1
OpenAI
78.6%
Est.79
24.61%
25
77.6%
74
78.8%56.6%73.8%25.56%
Showing 25 of 95

These rankings update weekly

Get notified when models move. One email a week with what changed and why.

Free. No spam. Unsubscribe anytime.

Score in Context

What these scores mean

Coding carries a 20% weight in BenchLM.ai's overall scoring. The weighted score blends SWE-bench Pro (real GitHub issues) and LiveCodeBench (competitive programming) equally. A 5-point gap is meaningful — it typically separates a model that can solve a complex multi-file bug from one that gets stuck.

Known limitations

HumanEval is saturated — frontier models all score 95%+, so it no longer differentiates. SWE-bench Verified is shown for reference but superseded by the harder Pro variant. LiveCodeBench is the most contamination-resistant coding signal because it continuously sources fresh problems.

How we weight

Coding carries a 20% weight in BenchLM.ai's overall scoring, making it the second most influential category after agentic execution.

Data contamination is a particular concern — HumanEval's problems have been public since 2021. LiveCodeBench continuously sources fresh problems, making it the most trustworthy mainstream coding signal. BenchLM also tracks React Native Evals as a display benchmark for framework-specific mobile app work. See the full coding leaderboard or compare model pricing.

Leaderboards exclude benchmark rows that BenchLM generated from other scores or cloned from reference models. When a weighted benchmark is missing after that filter, the category falls back to the remaining trustworthy public rows instead of filling the gap with synthetic values.

The full scoring rules, freshness handling, and runtime/pricing caveats live on the BenchLM methodology page.

BenchmarkWeightStatusDescription
SWE-bench Pro50%WeightedFrontier real-world SE tasks
LiveCodeBench50%WeightedContamination-free competitive programming
SWE-RebenchDisplay onlyFresh rolling-window GitHub issues
SWE-bench VerifiedDisplay onlyHistorical baseline, superseded by Pro
FLTEvalDisplay onlyLean 4 proof engineering, sparse coverage
React Native EvalsDisplay onlyFramework-specific mobile app engineering
HumanEvalDisplay onlySaturated by frontier models

Coding leaderboard updates

Know which model codes best — before your team picks the wrong one.

Free. No spam. Unsubscribe anytime.

About Coding Benchmarks

Python programming problems with test cases

Related