We ranked every major LLM by coding benchmarks — HumanEval, SWE-bench Verified, and LiveCodeBench. Here's which model actually comes out on top, and why the answer depends on what you're building.
GPT-5.3 Codex tops our coding leaderboard with an 88.3 average across HumanEval, SWE-bench Verified, and LiveCodeBench. But it's a specialized coding model, not a general-purpose one. If you need a model that also writes docs, answers questions, and handles other tasks, the picture gets more interesting.
We averaged the three coding benchmarks for every model we track. Here are the top 10.
| Rank | Model | Type | HumanEval | SWE-bench | LiveCodeBench | Avg |
|---|---|---|---|---|---|---|
| 1 | GPT-5.3 Codex | Reasoning | 95 | 85 | 85 | 88.3 |
| 2 | GPT-5.2 | Reasoning | 91 | 80 | 79 | 83.3 |
| 3 | GPT-5.4 | Reasoning | 91 | 81 | 75 | 82.3 |
| 4 | Claude Opus 4.6 | Non-Reasoning | 91 | 80 | 75 | 82.0 |
| 5 | Grok 4.1 | Non-Reasoning | 91 | 77 | 73 | 80.3 |
| 6 | Gemini 3.1 Pro | Non-Reasoning | 91 | 75 | 71 | 79.0 |
| 7 | GPT-5.2-Codex | Reasoning | 95 | 76 | 66 | 79.0 |
| 8 | GPT-5.1-Codex-Max | Reasoning | 94 | 75 | 67 | 78.7 |
| 9 | GPT-5.1 | Reasoning | 89 | 68 | 61 | 72.7 |
| 10 | Claude Sonnet 4.6 | Non-Reasoning | 93 | 69 | 54 | 72.0 |
Full rankings with filters: Best LLMs for Coding.
Look at the HumanEval column. Six models score 91. Two more score 94-95. The benchmark has a ceiling problem — it tests function-level Python generation, and frontier models have gotten too good at it. HumanEval now tells you almost nothing about whether Model A is better than Model B at real coding work.
SWE-bench Verified and LiveCodeBench are where the actual separation happens. SWE-bench tests multi-file bug fixes in real GitHub repos. LiveCodeBench pulls fresh competitive programming problems so models can't memorize them. The spread on these two benchmarks is much wider: 85 vs 71 on SWE-bench between first and sixth place, 85 vs 71 on LiveCodeBench.
If someone quotes you a HumanEval score in 2026, ask them about SWE-bench instead.
Here's the most interesting pattern in the data. Claude Opus 4.6 scores 82.0 on coding — essentially tied with GPT-5.4 at 82.3. But Opus 4.6 is a non-reasoning model. It doesn't use chain-of-thought at inference time. GPT-5.4 does.
That means Opus is matching a reasoning model's coding output without the extra compute step. For latency-sensitive applications like autocomplete or interactive coding assistants, that matters. Reasoning models think before they respond, and that delay adds up when you're waiting for suggestions every few keystrokes.
The top three spots are all reasoning models. But fourth place (Opus 4.6), fifth (Grok 4.1), and sixth (Gemini 3.1 Pro) are all non-reasoning. The gap between reasoning and non-reasoning isn't as large as you might expect — about 1-6 points of average depending on which models you compare.
If you need to self-host or fine-tune, GLM-5 (Reasoning) is the current open-weight leader for coding with a 69.3 average (HE: 88, SWE: 62, LCB: 58). Kimi K2.5 (Reasoning) is close behind at 69.0, and Qwen3.5 397B (Reasoning) rounds out the top three at 67.7.
All three are reasoning models. The best non-reasoning open-weight option is DeepSeek Coder 2.0 at 59.3 — a significant drop.
The gap between open-weight and proprietary is still large. GLM-5 (Reasoning) at 69.3 trails GPT-5.3 Codex at 88.3 by 19 points. That gap narrows on HumanEval (88 vs 95) but is painfully obvious on SWE-bench (62 vs 85) and LiveCodeBench (58 vs 85). Open-weight models still struggle with the real-world, multi-file coding tasks that SWE-bench tests.
Three standardized tests can't capture everything about how a model performs as a coding assistant. A few things that don't show up in these numbers:
Multi-file refactors. SWE-bench gets closest to this, but the patches are still relatively contained. No benchmark tests "rename this abstraction across 40 files and update all the tests." That's a huge chunk of real engineering work.
Framework-specific knowledge. Does the model know the right way to set up middleware in your particular web framework? Can it write idiomatic React, or does it produce code that works but looks like it's from 2019? Benchmarks test algorithmic ability, not framework fluency.
Agent loop quality. When a model is running inside Cursor, Copilot, or Claude Code, it's not just generating code once. It's reading errors, retrying, running tests, editing files. How well a model performs in that loop — recovering from mistakes, knowing when to stop — doesn't show up in any single-pass benchmark.
IDE integration. Tab-completion speed, context window usage, how well the model uses surrounding code as context. Two models with identical benchmark scores can feel completely different in your editor.
Autocomplete and tab-completion: You want low latency here. A non-reasoning model like Claude Opus 4.6 or Grok 4.1 will respond faster than a reasoning model. The coding quality difference at this level is minimal — you're generating 1-5 lines at a time, and all the top models handle that well.
Debugging and bug fixes: This is SWE-bench territory. GPT-5.3 Codex leads at 85, but it's a specialized model. Among general-purpose models, GPT-5.4 (81) and Claude Opus 4.6 (80) are your best bets. The difference is one point — pick based on price, latency, or which tool you prefer.
Greenfield projects and scaffolding: Context window matters here. Claude Opus 4.6 and GPT-5.4 both offer 1M token windows, letting you feed in large specs or existing codebases. Gemini 3.1 Pro also has 1M context. GPT-5.3 Codex is limited to 400K.
Competitive programming and algorithmic work: LiveCodeBench is the benchmark to watch. GPT-5.3 Codex dominates at 85. GPT-5.2 is second at 79. If you're doing LeetCode-style problems or contest prep, the Codex models have a clear edge.
Large refactors with agentic tools: Benchmark scores matter less here than the tool ecosystem. Claude Code supports parallel sub-agents for large-scale changes. Codex CLI from OpenAI has its own approach. Pick the model that works best with the agent framework you're using.
GPT-5.3 Codex wins on the numbers. It's the best coding model by a 5-point margin over second place. But it's purpose-built for code — you'll want a different model for non-coding tasks.
For a general-purpose model that's also excellent at coding, Claude Opus 4.6 and GPT-5.4 are within a point of each other. The choice between them comes down to whether you value faster response times (Opus, no chain-of-thought overhead) or slightly higher SWE-bench scores (GPT-5.4, one point higher).
And if you're self-hosting, GLM-5 (Reasoning) is your best option, but expect a meaningful quality gap versus the proprietary leaders.
All benchmark data is from our coding leaderboard. Compare models head-to-head on our Claude Opus 4.6 vs GPT-5.4 or GPT-5.3 Codex vs GPT-5.4 comparison pages.
On this page
A direct benchmark comparison of Anthropic's Claude Opus 4.6 and OpenAI's GPT-5.4 across 22 tests. We break down where each model leads and where benchmarks stop telling the full story.
Master LLM benchmarking with our comprehensive guide covering evaluation methodologies, best practices, and implementation strategies for 2025.