Coding Benchmarks
Programming and software development
HumanEval · SWE-bench Verified · LiveCodeBench · FLTEval · SWE-bench Pro
Coding benchmarks evaluate whether an AI model can write, debug, and understand code at a professional level. Coding now carries a 20% weight in BenchLM.ai's scoring system, making it the second most influential category after agentic execution.
BenchLM.ai scores coding using three benchmarks: SWE-bench Pro and LiveCodeBench carry the most weight as the strongest frontier signals, while SWE-bench Verified remains as a historical baseline. FLTEval is now displayed as a specialized formal verification and proof-engineering benchmark, but it is not yet weighted into the overall score because public coverage is still sparse. Legacy benchmarks like HumanEval are still displayed for reference but no longer factor into the overall score since frontier models have saturated them. A model scoring well on SWE-bench Pro and LiveCodeBench is usually the safer choice for real coding-agent work.
Data contamination is a particular concern in coding benchmarks — HumanEval's problems have been public since 2021. That's why LiveCodeBench, which continuously sources fresh problems, often shows wider score spreads and is considered the most trustworthy mainstream coding signal. FLTEval adds a different lens by testing repository-style Lean 4 proof work where the verifier is formal rather than human. See our coding rankings for the full leaderboard, or read our LiveCodeBench deep dive.
1 GPT-5.4 Pro OpenAI | Closed | Reasoning | 1.05M | 91 | 95% | 86% | 86% | — | 89% |
2 GPT-5.2 Pro OpenAI | Closed | Reasoning | 400K | 90 | 93% | 83% | 81% | — | 89% |
3 GPT-5.4 OpenAI | Closed | Reasoning | 1.05M | 90 | 95% | 84% | 84% | — | 85% |
4 GPT-5.3 Codex OpenAI | Closed | Reasoning | 400K | 89 | 95% | 85% | 85% | — | 90% |
5 GPT-5.2 OpenAI | Closed | Reasoning | 400K | 88 | 91% | 80% | 79% | — | 85% |
6 GPT-5.3 Instant OpenAI | Closed | Reasoning | 128K | 87 | 88% | 76% | 75% | — | 83% |
7 GPT-5.3-Codex-Spark OpenAI | Closed | Reasoning | 256K | 87 | 91% | 80% | 80% | — | 85% |
8 Claude Opus 4.6 Anthropic | Closed | Standard | 1M | 85 | 91% | 80% | 75% | 39.6% | 74% |
9 GPT-5.2 Instant OpenAI | Closed | Reasoning | 128K | 85 | 87% | 75% | 74% | — | 77% |
10 GPT-5.2-Codex OpenAI | Closed | Reasoning | 400K | 85 | 95% | 76% | 66% | — | 86% |
11 Gemini 3.1 Pro Google | Closed | Standard | 1M | 84 | 91% | 75% | 71% | — | 72% |
12 GPT-5.1-Codex-Max OpenAI | Closed | Reasoning | 400K | 84 | 94% | 75% | 67% | — | 84% |
13 Grok 4.1 xAI | Closed | Standard | 1M | 84 | 91% | 77% | 73% | — | 73% |
14 Gemini 3 Pro Deep Think Google | Closed | Reasoning | 2M | 81 | 91% | 58% | 58% | — | 63% |
15 GPT-5.1 OpenAI | Closed | Reasoning | 200K | 80 | 89% | 68% | 61% | — | 71% |
16 GPT-5 (high) OpenAI | Closed | Reasoning | 128K | 79 | 85% | 67% | 62% | — | 70% |
17 Claude Sonnet 4.6 Anthropic | Closed | Standard | 200K | 78 | 93% | 69% | 54% | 23.7% | 64% |
18 GLM-5 (Reasoning) Zhipu AI | Open | Reasoning | 200K | 78 | 88% | 62% | 58% | — | 67% |
19 GPT-5 (medium) OpenAI | Closed | Reasoning | 128K | 78 | 83% | 67% | 60% | — | 72% |
20 Claude Opus 4.5 Anthropic | Closed | Standard | 200K | 77 | 91% | 68% | 57% | — | 62% |
21 Gemini 3 Pro Google | Closed | Standard | 2M | 77 | 91% | 59% | 49% | — | 58% |
22 o1-preview OpenAI | Closed | Reasoning | 200K | 77 | 86% | 65% | 60% | — | 69% |
23 Claude Sonnet 4.5 Anthropic | Closed | Standard | 200K | 76 | 87% | 66% | 53% | — | 60% |
24 Grok 4.1 Fast xAI | Closed | Standard | 1M | 76 | 86% | 68% | 54% | — | 63% |
25 Kimi K2.5 (Reasoning) Moonshot AI | Closed | Reasoning | 128K | 76 | 84% | 65% | 58% | — | 70% |
Coding leaderboard updates
Get notified when SWE-bench, LiveCodeBench, or FLTEval scores change.
Free. No spam. Unsubscribe anytime. We only store derived location metadata for consent routing.
About Coding Benchmarks
Python programming problems with test cases