What is the best LLM for coding?

Top performers on SWE-bench Pro, SWE-Rebench, LiveCodeBench, FLTEval, and HumanEval can be compared on our live coding leaderboard.

How is coding benchmark performance measured?

Coding benchmarks measure performance by evaluating models on tasks like code generation, bug fixing, and code completion, scoring based on functional correctness of the output.

What benchmarks test coding ability in LLMs?

Key coding benchmarks include SWE-bench Pro, SWE-Rebench, LiveCodeBench, SWE-bench Verified, FLTEval, HumanEval, and React Native Evals, together covering software engineering, contamination-resistant problem solving, formal verification, and framework-specific mobile app tasks.

What is the SWE-bench Pro leaderboard?

SWE-bench Pro is a frontier coding agent benchmark that tests AI models on real-world software engineering tasks from GitHub repositories. It is harder than SWE-bench Verified and carries 50% of the coding category weight on BenchLM.ai.

How does LiveCodeBench prevent data contamination?

LiveCodeBench continuously sources fresh competitive programming problems after model training cutoff dates, making it resistant to data contamination.

Which AI model has the best SWE-bench score?

Check our live SWE-bench leaderboard for the latest rankings. The leaderboard tracks both SWE-bench Pro and SWE-bench Verified scores across 226 models.

Coding

AI Coding Benchmarks — SWE-bench & LiveCodeBench Leaderboard

Name: Coding Benchmarks — LLM Leaderboard
Creator: BenchLM.ai

Programming and software development

Bottom line: Claude Mythos Preview dominates SWE-bench Pro, but GPT-5.3 Codex is the strongest open-weight alternative for cost-sensitive teams.

HumanEval · SWE-bench Verified · LiveCodeBench · LiveCodeBench Pro · FLTEval · SWE-bench Pro · SWE-Rebench · SWE Multilingual · Multi-SWE Bench · VIBE-Pro · NL2Repo · Vibe Code Bench · React Native Evals · SWE-bench Verified*

Best Coding picks

BenchLM summaries for coding plus the practical tradeoffs users check next: open weights, price, speed, latency, and context.

How BenchLM scores these

Best Coding

Claude Mythos Preview

100

category score

Anthropic

Best Open Weight

DeepSeek V4 Pro (Max)

overall score

DeepSeek

Cheapest

Qwen3.6-27B

$0.00

avg / 1M tokens

Alibaba

Fastest

Mercury 2

789

tokens / sec

Inception

Lowest Latency

LFM2-24B-A2B

0.42s

TTFT

LiquidAI

Largest Context

Nemotron 3 Ultra 500B

10M

context window

NVIDIA

Top AI Models for Coding — May 2026

As of May 2026, Claude Mythos Preview leads the provisional coding leaderboard with a weighted score of 100.0%, followed by Claude Opus 4.7 (Adaptive) (95.2%) and Gemini 3.1 Pro (93.5%). BenchLM is currently showing 95 provisional-ranked models and 15 verified-ranked models in this category.

1Proprietary

Claude Mythos Preview

Anthropic

100.0%weighted

Highest SWE-bench Pro score ever. Premium-priced but unmatched on real-world SE tasks.

SWE-bench Verified 93.9SWE-bench Pro 77.8

2Proprietary

Claude Opus 4.7 (Adaptive)

Anthropic

95.2%weighted

SWE-bench Verified 87.6SWE-bench Pro 64.3

3Proprietary

Gemini 3.1 Pro

Google

93.5%weighted

95 provisional-ranked15 verified-ranked14 benchmarksUpdated May 1, 2026

What changed

GPT-5.3 Codex jumped to #2 on SWE-bench Pro with a 77.3 score, the highest open-weight coding result ever.

Claude Mythos Preview entered as the new coding leader with a perfect 100.0 weighted score.

GPT-5.4 now #3, overtaking Claude Opus 4.6 on LiveCodeBench.

How to choose

Building a coding agent or copilot?

Claude Mythos Preview — best raw coding accuracy

Need open weights for self-hosting?

GPT-5.3 Codex — top open-weight coding model

Balancing cost and quality?

GPT-5.4 — strong coding at moderate pricing

Competitive programming tasks?

GPT-5.4 — leads on LiveCodeBench

Top models by benchmark

Real-world GitHub issues from popular Python repos, human-verified subset(13% of category score)

1Claude Mythos Preview

93.9

2Claude Opus 4.7 (Adaptive)

87.6

3GPT-5.3 Codex

~85

4Claude Opus 4.5

80.9

5Claude Opus 4.6

80.84

SWE-bench Pro & LiveCodeBench Leaderboard

Updated May 1, 2026

Sorted by coding weighted score. Switch between provisional-ranked and verified-ranked modes to see the broader public dataset versus sourced-only ranking. Click column headers to re-sort by overall score or any benchmark.

95 ranked models

CSV JSON

Provisional-ranked mode includes source-unverified non-generated benchmark evidence.P = provisional benchmark row


1 Claude Mythos Preview Anthropic	Closed	Reasoning	1M	$25.00 / $125.00	N/A	N/A	100%	99	—	93.9%	—	—	—	77.8%	—	—	—	—	—	—	—	—
2 Claude Opus 4.7 (Adaptive) Anthropic	Closed	Reasoning	1M	$5.00 / $25.00	N/A	N/A	95.2%	90	—	87.6%	—	—	—	64.3%	—	—	—	—	—	—	—	—
3 Gemini 3.1 Pro Google	Closed	Standard	1M	$2.00 / $12.00	109	29.71s	93.5%	92	—	—	—	82.9%	—	—	—	—	—	—	—	32.03%	78.9%	—
4 DeepSeek V4 Pro (Max) DeepSeek Self-host	Open	Reasoning	1M	$1.74 / $3.48	N/A	N/A	90.5%	88	—	80.6%	93.5%	—	—	55.4%	—	76.2%	—	—	—	49.93%	—	—
5 GPT-5.4 OpenAI	Closed	Reasoning	1.05M	$2.50 / $15.00	74	151.79s	89.3%	89	—	—	—	87.5%	—	57.7%	—	—	—	—	—	67.42%	85.3%	—
6 GPT-5.3 Codex OpenAI	Closed	Reasoning	400K	$1.75 / $14.00	79	88.26s	88.7%	Est.87	—	85%	—	—	—	56.8%	58.2%	—	—	—	—	61.77%	—	—
7 Kimi K2.6 Moonshot AI Self-host	Open	Reasoning	256K	$0.95 / $4.00	N/A	N/A	88.7%	84	—	80.2%	89.6%	—	—	58.6%	—	76.7%	—	—	—	37.89%	—	—
8 DeepSeek V4 Pro (High) DeepSeek Self-host	Open	Reasoning	1M	$1.74 / $3.48	N/A	N/A	87.1%	84	—	79.4%	89.8%	—	—	54.4%	—	74.1%	—	—	—	—	—	—
9 Claude Opus 4.6 Anthropic	Closed	Standard	1M	$5.00 / $25.00	40	1.78s	86.9%	87	—	80.8%	—	70.7%	—	53.4%	65.3%	—	—	—	—	57.57%	84.1%	75.6%
10 Kimi K2.5 (Reasoning) Moonshot AI Self-host	Closed	Reasoning	128K	$0.60 / $3.00	N/A	N/A	86.7%	76	—	76.8%	—	—	—	—	—	—	—	—	—	17.54%	—	—
11 DeepSeek V4 Flash (High) DeepSeek Self-host	Open	Reasoning	1M	$0.14 / $0.28	N/A	N/A	83.7%	71	—	78.6%	88.4%	—	—	52.3%	—	70.2%	—	—	—	—	—	—
12 GLM-5.1 Z.AI Self-host	Open	Reasoning	203K	$1.40 / $4.40	N/A	N/A	83.5%	83	—	—	—	—	—	58.4%	62.7%	—	—	—	42.7%	31.46%	—	—
13 Grok 4.1 xAI	Closed	Standard	1M	N/A	N/A	N/A	83.4%	Est.90	—	—	—	—	—	—	—	—	—	—	—	—	—	—
14 Claude Sonnet 4.6 Anthropic	Closed	Standard	200K	$3.00 / $15.00	44	1.48s	83.2%	83	—	79.6%	—	—	—	—	60.7%	—	—	—	—	51.48%	80.6%	—
15 DeepSeek V4 Flash (Max) DeepSeek Self-host	Open	Reasoning	1M	$0.14 / $0.28	N/A	N/A	83.1%	76	—	79%	91.6%	—	—	52.6%	—	73.3%	—	—	—	—	—	—
16 Kimi K2.5 Moonshot AI Self-host	Open	Standard	256K	$0.60 / $3.00	45	2.38s	82.6%	64	—	76.8%	85%	—	—	50.7%	58.5%	73%	—	—	—	—	77.2%	70.8%
17 GPT-5.2 OpenAI	Closed	Reasoning	400K	$1.75 / $14.00	73	130.34s	82.5%	81	—	80%	—	—	—	55.6%	—	—	—	—	—	53.50%	—	—
18 o1-preview OpenAI	Closed	Reasoning	200K	$15.00 / $60.00	N/A	N/A	80.5%	Est.83	—	—	—	—	—	—	—	—	—	—	—	—	—	—
19 Qwen3.6-27B Alibaba Self-host	Open	Reasoning	262K	$0.00 / $0.00	N/A	N/A	79.9%	73	—	77.2%	83.9%	—	—	53.5%	—	71.3%	—	—	36.2%	—	—	—
20 GPT-5.2-Codex OpenAI	Closed	Reasoning	400K	$1.75 / $14.00	123	87.34s	79.4%	Est.78	—	—	—	—	—	—	—	—	—	—	—	37.91%	—	—
21 GPT-5 (medium) OpenAI	Closed	Reasoning	128K	N/A	83	36.28s	79.4%	Est.72	—	—	—	—	—	—	—	—	—	—	—	—	—	—
22 Claude Sonnet 4.5 Anthropic	Closed	Standard	200K	$3.00 / $15.00	N/A	N/A	79.3%	Est.66	—	77.2%	—	—	—	—	—	—	—	—	—	—	—	—
23 GPT-5.1 OpenAI	Closed	Reasoning	200K	$1.25 / $10.00	111	57.47s	78.6%	Est.79	—	—	—	—	—	—	—	—	—	—	—	24.61%	—	—
24 Grok 4 xAI	Closed	Standard	128K	N/A	54	15.60s	78.6%	Est.65	—	—	—	—	—	—	—	—	—	—	—	—	72.6%	—
25 Claude Opus 4.5 Anthropic	Closed	Standard	200K	$5.00 / $25.00	46	1.01s	77.6%	77	—	80.9%	—	—	—	57.1%	—	77.5%	—	—	43.2%	—	—	—

Showing 25 of 95

These rankings update weekly

Get notified when models move. One email a week with what changed and why.

Free. No spam. Unsubscribe anytime.

Score in Context

What these scores mean

Coding carries a 20% weight in BenchLM.ai's overall scoring. The weighted score blends SWE-bench Pro (real GitHub issues) and LiveCodeBench (competitive programming) equally. A 5-point gap is meaningful — it typically separates a model that can solve a complex multi-file bug from one that gets stuck.

Known limitations

HumanEval is saturated — frontier models all score 95%+, so it no longer differentiates. SWE-bench Verified is shown for reference but superseded by the harder Pro variant. LiveCodeBench is the most contamination-resistant coding signal because it continuously sources fresh problems.

How we weight

Coding carries a 20% weight in BenchLM.ai's overall scoring, making it the second most influential category after agentic execution.

Data contamination is a particular concern — HumanEval's problems have been public since 2021. LiveCodeBench continuously sources fresh problems, making it the most trustworthy mainstream coding signal. BenchLM also tracks React Native Evals as a display benchmark for framework-specific mobile app work. See the full coding leaderboard or compare model pricing.

Leaderboards exclude benchmark rows that BenchLM generated from other scores or cloned from reference models. When a weighted benchmark is missing after that filter, the category falls back to the remaining trustworthy public rows instead of filling the gap with synthetic values.

The full scoring rules, freshness handling, and runtime/pricing caveats live on the BenchLM methodology page.

Benchmark	Weight	Status	Description
SWE-bench Pro	50%	Weighted	Frontier real-world SE tasks
LiveCodeBench	50%	Weighted	Contamination-free competitive programming
SWE-Rebench	—	Display only	Fresh rolling-window GitHub issues
SWE-bench Verified	—	Display only	Historical baseline, superseded by Pro
FLTEval	—	Display only	Lean 4 proof engineering, sparse coverage
React Native Evals	—	Display only	Framework-specific mobile app engineering
HumanEval	—	Display only	Saturated by frontier models

Coding leaderboard updates

Know which model codes best — before your team picks the wrong one.

Free. No spam. Unsubscribe anytime.

About Coding Benchmarks

Python programming problems with test cases

Best LLMs Overall

Top models ranked across all benchmark categories.

View

Best Open-Weight Models

Top open-source models for code generation and debugging.

View

Agentic Benchmarks

How models perform on autonomous coding agent tasks.

View

AI Cost Calculator

Compare pricing across models for coding workloads.

View

AI Coding Benchmarks — SWE-bench & LiveCodeBench Leaderboard

Best Coding picks

Top AI Models for Coding — May 2026

What changed

How to choose

Top models by benchmark

SWE-bench Pro & LiveCodeBench Leaderboard

These rankings update weekly

Score in Context

What these scores mean

Known limitations

How we weight

Coding leaderboard updates

About Coding Benchmarks

Related

Stay ahead of the LLM curve