What is the best LLM for coding?

The best LLM for coding varies by task, but top performers on benchmarks like SWE-bench Pro, LiveCodeBench, and HumanEval can be compared on our coding leaderboard.

How is coding benchmark performance measured?

Coding benchmarks measure performance by evaluating models on tasks like code generation, bug fixing, and code completion, scoring based on functional correctness of the output.

What benchmarks test coding ability in LLMs?

Key coding benchmarks include SWE-bench Pro, LiveCodeBench, SWE-bench Verified, and HumanEval, each testing different aspects of programming capability from simple function completion to real-world software engineering.

Coding

Coding Benchmarks

Name: Coding Benchmarks — LLM Leaderboard
Creator: BenchLM.ai

Programming and software development

HumanEval · SWE-bench Verified · LiveCodeBench · SWE-bench Pro

Coding benchmarks evaluate whether an AI model can write, debug, and understand code at a professional level. Coding now carries a 20% weight in BenchLM.ai's scoring system, making it the second most influential category after agentic execution.

BenchLM.ai scores coding using three benchmarks: SWE-bench Pro and LiveCodeBench carry the most weight as the strongest frontier signals, while SWE-bench Verified remains as a historical baseline. Legacy benchmarks like HumanEval are still displayed for reference but no longer factor into the overall score since frontier models have saturated them. A model scoring well on SWE-bench Pro and LiveCodeBench is usually the safer choice for real coding-agent work.

Data contamination is a particular concern in coding benchmarks — HumanEval's problems have been public since 2021. That's why LiveCodeBench, which continuously sources fresh problems, often shows wider score spreads and is considered the most trustworthy signal. See our coding rankings for the full leaderboard, or read our LiveCodeBench deep dive.

124 models


1 GPT-5.4 Pro OpenAI	Closed	Reasoning	1.05M	91	95%	86%	86%	89%
2 GPT-5.2 Pro OpenAI	Closed	Reasoning	400K	90	93%	83%	81%	89%
3 GPT-5.4 OpenAI	Closed	Reasoning	1.05M	90	95%	84%	84%	85%
4 GPT-5.3 Codex OpenAI	Closed	Reasoning	400K	89	95%	85%	85%	90%
5 GPT-5.2 OpenAI	Closed	Reasoning	400K	88	91%	80%	79%	85%
6 GPT-5.3 Instant OpenAI	Closed	Reasoning	128K	87	88%	76%	75%	83%
7 GPT-5.3-Codex-Spark OpenAI	Closed	Reasoning	256K	87	91%	80%	80%	85%
8 Claude Opus 4.6 Anthropic	Closed	Standard	1M	85	91%	80%	75%	74%
9 GPT-5.2 Instant OpenAI	Closed	Reasoning	128K	85	87%	75%	74%	77%
10 GPT-5.2-Codex OpenAI	Closed	Reasoning	400K	85	95%	76%	66%	86%
11 Gemini 3.1 Pro Google	Closed	Standard	1M	84	91%	75%	71%	72%
12 GPT-5.1-Codex-Max OpenAI	Closed	Reasoning	400K	84	94%	75%	67%	84%
13 Grok 4.1 xAI	Closed	Standard	1M	84	91%	77%	73%	73%
14 Gemini 3 Pro Deep Think Google	Closed	Reasoning	2M	81	91%	58%	58%	63%
15 GPT-5.1 OpenAI	Closed	Reasoning	200K	80	89%	68%	61%	71%
16 GPT-5 (high) OpenAI	Closed	Reasoning	128K	79	85%	67%	62%	70%
17 Claude Sonnet 4.6 Anthropic	Closed	Standard	200K	78	93%	69%	54%	64%
18 GLM-5 (Reasoning) Zhipu AI	Open	Reasoning	200K	78	88%	62%	58%	67%
19 GPT-5 (medium) OpenAI	Closed	Reasoning	128K	78	83%	67%	60%	72%
20 Claude Opus 4.5 Anthropic	Closed	Standard	200K	77	91%	68%	57%	62%
21 Gemini 3 Pro Google	Closed	Standard	2M	77	91%	59%	49%	58%
22 o1-preview OpenAI	Closed	Reasoning	200K	77	86%	65%	60%	69%
23 Claude Sonnet 4.5 Anthropic	Closed	Standard	200K	76	87%	66%	53%	60%
24 Grok 4.1 Fast xAI	Closed	Standard	1M	76	86%	68%	54%	63%
25 Kimi K2.5 (Reasoning) Moonshot AI	Closed	Reasoning	128K	76	84%	65%	58%	70%

Showing 25 of 124

Coding leaderboard updates

Get notified when SWE-bench, HumanEval, or LiveCodeBench scores change.

Free. No spam. Unsubscribe anytime. We only store derived location metadata for consent routing.

About Coding Benchmarks

Python programming problems with test cases