AI Coding Benchmarks — SWE-bench & LiveCodeBench Leaderboard
Programming and software development
Bottom line: Claude Mythos Preview dominates SWE-bench Pro, but GPT-5.3 Codex is the strongest open-weight alternative for cost-sensitive teams.
HumanEval · SWE-bench Verified · LiveCodeBench · LiveCodeBench Pro · FLTEval · SWE-bench Pro · SWE-Rebench · SWE Multilingual · Multi-SWE Bench · VIBE-Pro · NL2Repo · Vibe Code Bench · React Native Evals · SWE-bench Verified*
Best Coding picks
BenchLM summaries for coding plus the practical tradeoffs users check next: open weights, price, speed, latency, and context.
Claude Mythos Preview
100
category score
Anthropic
DeepSeek V4 Pro (Max)
88
overall score
DeepSeek
Qwen3.6-27B
$0.00
avg / 1M tokens
Alibaba
Mercury 2
789
tokens / sec
Inception
LFM2-24B-A2B
0.42s
TTFT
LiquidAI
Nemotron 3 Ultra 500B
10M
context window
NVIDIA
Top AI Models for Coding — April 2026
As of April 2026, Claude Mythos Preview leads the provisional coding leaderboard with a weighted score of 100.0%, followed by Claude Opus 4.7 (Adaptive) (95.2%) and Gemini 3.1 Pro (93.5%). BenchLM is currently showing 95 provisional-ranked models and 15 verified-ranked models in this category.
Claude Mythos Preview
Anthropic
Highest SWE-bench Pro score ever. Premium-priced but unmatched on real-world SE tasks.
Claude Opus 4.7 (Adaptive)
Anthropic
Gemini 3.1 Pro
What changed
GPT-5.3 Codex jumped to #2 on SWE-bench Pro with a 77.3 score, the highest open-weight coding result ever.
Claude Mythos Preview entered as the new coding leader with a perfect 100.0 weighted score.
GPT-5.4 now #3, overtaking Claude Opus 4.6 on LiveCodeBench.
How to choose
Top models by benchmark
Real-world GitHub issues from popular Python repos, human-verified subset(13% of category score)
SWE-bench Pro & LiveCodeBench Leaderboard
Updated April 29, 2026Sorted by coding weighted score. Switch between provisional-ranked and verified-ranked modes to see the broader public dataset versus sourced-only ranking. Click column headers to re-sort by overall score or any benchmark.
1 Claude Mythos Preview Anthropic | 100% | 99 | — | 93.9% | — | — | — | 77.8% | — | — | — | — | — | — | — | — |
2 Claude Opus 4.7 (Adaptive) Anthropic | 95.2% | 90 | — | 87.6% | — | — | — | 64.3% | — | — | — | — | — | — | — | — |
3 Gemini 3.1 Pro Google | 93.5% | 92 | — | — | — | 82.9% | — | — | — | — | — | — | — | 32.03% | 78.9% | — |
4 | 90.4% | 88 | — | 80.6% | 93.5% | — | — | 55.4% | — | 76.2% | — | — | — | 49.93% | — | — |
5 GPT-5.4 OpenAI | 89.3% | 89 | — | — | — | 87.5% | — | 57.7% | — | — | — | — | — | 67.42% | 85.3% | — |
6 GPT-5.3 Codex OpenAI | 88.7% | Est.87 | — | 85% | — | — | — | 56.8% | 58.2% | — | — | — | — | 61.77% | — | — |
| 88.7% | 84 | — | 80.2% | 89.6% | — | — | 58.6% | — | 76.7% | — | — | — | 37.89% | — | — | |
8 | 87.1% | 84 | — | 79.4% | 89.8% | — | — | 54.4% | — | 74.1% | — | — | — | — | — | — |
9 Claude Opus 4.6 Anthropic | 86.9% | 87 | — | 80.8% | — | 70.7% | — | 53.4% | 65.3% | — | — | — | — | 57.57% | 84.1% | 75.6% |
10 | 86.8% | 76 | — | 76.8% | — | — | — | — | — | — | — | — | — | 17.54% | — | — |
11 | 83.9% | 71 | — | 78.6% | 88.4% | — | — | 52.3% | — | 70.2% | — | — | — | — | — | — |
| 83.5% | 83 | — | — | — | — | — | 58.4% | 62.7% | — | — | — | 42.7% | 31.46% | — | — | |
13 Grok 4.1 xAI | 83.4% | Est.90 | — | — | — | — | — | — | — | — | — | — | — | — | — | — |
14 | 83.3% | 76 | — | 79% | 91.6% | — | — | 52.6% | — | 73.3% | — | — | — | — | — | — |
15 Claude Sonnet 4.6 Anthropic | 83.2% | 83 | — | 79.6% | — | — | — | — | 60.7% | — | — | — | — | 51.48% | 80.6% | — |
| 82.6% | 64 | — | 76.8% | 85% | — | — | 50.7% | 58.5% | 73% | — | — | — | — | 77.2% | 70.8% | |
17 GPT-5.2 OpenAI | 82.5% | 81 | — | 80% | — | — | — | 55.6% | — | — | — | — | — | 53.50% | — | — |
18 o1-preview OpenAI | 80.5% | Est.83 | — | — | — | — | — | — | — | — | — | — | — | — | — | — |
19 | 80% | 74 | — | 77.2% | 83.9% | — | — | 53.5% | — | 71.3% | — | — | 36.2% | — | — | — |
20 GPT-5 (medium) OpenAI | 79.5% | Est.72 | — | — | — | — | — | — | — | — | — | — | — | — | — | — |
21 GPT-5.2-Codex OpenAI | 79.4% | Est.78 | — | — | — | — | — | — | — | — | — | — | — | 37.91% | — | — |
22 Claude Sonnet 4.5 Anthropic | 79.3% | Est.66 | — | 77.2% | — | — | — | — | — | — | — | — | — | — | — | — |
23 Grok 4 xAI | 78.8% | Est.65 | — | — | — | — | — | — | — | — | — | — | — | — | 72.6% | — |
24 GPT-5.1 OpenAI | 78.6% | Est.79 | — | — | — | — | — | — | — | — | — | — | — | 24.61% | — | — |
25 Qwen3.6 Plus Alibaba | 77.6% | 74 | — | 78.8% | — | — | — | 56.6% | — | 73.8% | — | — | — | 25.56% | — | — |
These rankings update weekly
Get notified when models move. One email a week with what changed and why.
Free. No spam. Unsubscribe anytime.
Score in Context
What these scores mean
Coding carries a 20% weight in BenchLM.ai's overall scoring. The weighted score blends SWE-bench Pro (real GitHub issues) and LiveCodeBench (competitive programming) equally. A 5-point gap is meaningful — it typically separates a model that can solve a complex multi-file bug from one that gets stuck.
Known limitations
HumanEval is saturated — frontier models all score 95%+, so it no longer differentiates. SWE-bench Verified is shown for reference but superseded by the harder Pro variant. LiveCodeBench is the most contamination-resistant coding signal because it continuously sources fresh problems.
How we weight
Coding carries a 20% weight in BenchLM.ai's overall scoring, making it the second most influential category after agentic execution.
Data contamination is a particular concern — HumanEval's problems have been public since 2021. LiveCodeBench continuously sources fresh problems, making it the most trustworthy mainstream coding signal. BenchLM also tracks React Native Evals as a display benchmark for framework-specific mobile app work. See the full coding leaderboard or compare model pricing.
Leaderboards exclude benchmark rows that BenchLM generated from other scores or cloned from reference models. When a weighted benchmark is missing after that filter, the category falls back to the remaining trustworthy public rows instead of filling the gap with synthetic values.
The full scoring rules, freshness handling, and runtime/pricing caveats live on the BenchLM methodology page.
| Benchmark | Weight | Status | Description |
|---|---|---|---|
| SWE-bench Pro | 50% | Weighted | Frontier real-world SE tasks |
| LiveCodeBench | 50% | Weighted | Contamination-free competitive programming |
| SWE-Rebench | — | Display only | Fresh rolling-window GitHub issues |
| SWE-bench Verified | — | Display only | Historical baseline, superseded by Pro |
| FLTEval | — | Display only | Lean 4 proof engineering, sparse coverage |
| React Native Evals | — | Display only | Framework-specific mobile app engineering |
| HumanEval | — | Display only | Saturated by frontier models |
Coding leaderboard updates
Know which model codes best — before your team picks the wrong one.
Free. No spam. Unsubscribe anytime.
About Coding Benchmarks
Python programming problems with test cases
Related
Best LLMs Overall
Top models ranked across all benchmark categories.
Best Open-Weight Models
Top open-source models for code generation and debugging.
Agentic Benchmarks
How models perform on autonomous coding agent tasks.
AI Cost Calculator
Compare pricing across models for coding workloads.