Coding

Coding Benchmarks

Programming and software development

HumanEval · SWE-bench Verified · LiveCodeBench · SWE-bench Pro

Coding benchmarks evaluate whether an AI model can write, debug, and understand code at a professional level. Coding now carries a 20% weight in BenchLM.ai's scoring system, making it the second most influential category after agentic execution.

BenchLM.ai scores coding using three benchmarks: SWE-bench Pro and LiveCodeBench carry the most weight as the strongest frontier signals, while SWE-bench Verified remains as a historical baseline. Legacy benchmarks like HumanEval are still displayed for reference but no longer factor into the overall score since frontier models have saturated them. A model scoring well on SWE-bench Pro and LiveCodeBench is usually the safer choice for real coding-agent work.

Data contamination is a particular concern in coding benchmarks — HumanEval's problems have been public since 2021. That's why LiveCodeBench, which continuously sources fresh problems, often shows wider score spreads and is considered the most trustworthy signal. See our coding rankings for the full leaderboard, or read our LiveCodeBench deep dive.

124 models
1
GPT-5.4 Pro
OpenAI
ClosedReasoning1.05M9195%86%86%89%
2
GPT-5.2 Pro
OpenAI
ClosedReasoning400K9093%83%81%89%
3
GPT-5.4
OpenAI
ClosedReasoning1.05M9095%84%84%85%
4
GPT-5.3 Codex
OpenAI
ClosedReasoning400K8995%85%85%90%
5
GPT-5.2
OpenAI
ClosedReasoning400K8891%80%79%85%
6
GPT-5.3 Instant
OpenAI
ClosedReasoning128K8788%76%75%83%
7
GPT-5.3-Codex-Spark
OpenAI
ClosedReasoning256K8791%80%80%85%
8
Claude Opus 4.6
Anthropic
ClosedStandard1M8591%80%75%74%
9
GPT-5.2 Instant
OpenAI
ClosedReasoning128K8587%75%74%77%
10
GPT-5.2-Codex
OpenAI
ClosedReasoning400K8595%76%66%86%
11
Gemini 3.1 Pro
Google
ClosedStandard1M8491%75%71%72%
12
GPT-5.1-Codex-Max
OpenAI
ClosedReasoning400K8494%75%67%84%
13
Grok 4.1
xAI
ClosedStandard1M8491%77%73%73%
14
Gemini 3 Pro Deep Think
Google
ClosedReasoning2M8191%58%58%63%
15
GPT-5.1
OpenAI
ClosedReasoning200K8089%68%61%71%
16
GPT-5 (high)
OpenAI
ClosedReasoning128K7985%67%62%70%
17
Claude Sonnet 4.6
Anthropic
ClosedStandard200K7893%69%54%64%
18
GLM-5 (Reasoning)
Zhipu AI
OpenReasoning200K7888%62%58%67%
19
GPT-5 (medium)
OpenAI
ClosedReasoning128K7883%67%60%72%
20
Claude Opus 4.5
Anthropic
ClosedStandard200K7791%68%57%62%
21
Gemini 3 Pro
Google
ClosedStandard2M7791%59%49%58%
22
o1-preview
OpenAI
ClosedReasoning200K7786%65%60%69%
23
Claude Sonnet 4.5
Anthropic
ClosedStandard200K7687%66%53%60%
24
Grok 4.1 Fast
xAI
ClosedStandard1M7686%68%54%63%
25
Kimi K2.5 (Reasoning)
Moonshot AI
ClosedReasoning128K7684%65%58%70%
Showing 25 of 124

Coding leaderboard updates

Get notified when SWE-bench, HumanEval, or LiveCodeBench scores change.

Free. No spam. Unsubscribe anytime. We only store derived location metadata for consent routing.

About Coding Benchmarks

Python programming problems with test cases