Which AI model is best for coding in 2026? We rank every major LLM by SWE-bench Verified, LiveCodeBench, and SWE-bench Pro scores — with pricing and use-case guidance.
GPT-5.3 Codex leads the 2026 coding leaderboard with a 90 SWE-bench Pro score — and at $2.50/$10 per million tokens, it's one of the few frontier coding models that doesn't require a flagship budget. GPT-5.4 Pro scores slightly better on SWE-bench Verified (86 vs 85) but costs 12x more. For most teams, GPT-5.3 Codex hits the best balance of capability and cost.
This ranking uses SWE-bench Pro, SWE-bench Verified, and LiveCodeBench as the primary signals. HumanEval is saturated — six models score 95% — and no longer differentiates between frontier coding models.
| Model | SWE-bench Pro | SWE-bench Verified | LiveCodeBench | HumanEval | Price (input/output) |
|---|---|---|---|---|---|
| GPT-5.3 Codex | 90 | 85 | 85 | 95 | $2.50/$10 |
| GPT-5.4 Pro | 89 | 86 | 86 | 95 | $30/$180 |
| GPT-5.2 Pro | 89 | 83 | 81 | 93 | $25/$150 |
| GPT-5.2-Codex | 86 | 76 | 66 | 95 | $2/$8 |
| GPT-5.4 | 85 | 84 | 84 | 95 | $2.50/$15 |
| GPT-5.2 | 85 | 80 | 79 | 91 | $2/$8 |
| GPT-5.3-Codex-Spark | 85 | 80 | 80 | 91 | $2/$8 |
| Claude Opus 4.6 | 74 | 80 | 75 | 91 | $15/$75 |
| Grok 4.1 | 73 | 77 | 73 | 91 | $3/$15 |
| Gemini 3.1 Pro | 72 | 75 | 71 | 91 | $1.25/$5 |
| Claude Sonnet 4.6 | 64 | 69 | 54 | 93 | $3/$15 |
| DeepSeek Coder 2.0 | 61 | 51 | 45 | 82 | $0.27/$1.10 |
Scores from BenchLM.ai leaderboard. Prices per million tokens.
GPT-5.3 Codex leads on SWE-bench Pro (90) and ties GPT-5.4 Pro on HumanEval (95). What makes this notable is the price: $2.50/$10 per million tokens, compared to $30/$180 for GPT-5.4 Pro.
The gap between GPT-5.3 Codex and GPT-5.4 Pro on SWE-bench Verified is one point (85 vs 86). On LiveCodeBench it's also one point (85 vs 86). For nearly every practical coding task, the performance difference will be imperceptible. The cost difference is not.
For an AI coding assistant generating 10M output tokens per month, GPT-5.3 Codex costs $100/month. GPT-5.4 Pro costs $1,800/month. That math drives model selection at any meaningful scale.
OpenAI's coding model lineup in 2026 is confusing. Here's how to read it:
"Codex" suffix = coding-specialized variant. Higher SWE-bench scores but may underperform general models on open-ended chat and reasoning.
"Spark" suffix = lighter, faster variant. GPT-5.3-Codex-Spark ($2/$8) scores 85 SWE-bench Pro vs GPT-5.3 Codex's 90, but costs 20% less on input and 20% less on output.
"Pro" suffix = highest-capability flagship. GPT-5.4 Pro and GPT-5.2 Pro lead on overall benchmarks but are priced for enterprise budgets.
The practical tiers for coding:
Short completions (under 50 tokens) don't require SWE-bench-level capability. The latency and cost profile matter more than marginal benchmark differences.
Best options: GPT-5.3-Codex-Spark ($2/$8) for quality completions, Gemini 3.1 Pro ($1.25/$5) for cost-sensitive high-volume use. Both score 91%+ on HumanEval.
This is exactly what SWE-bench measures. GPT-5.3 Codex (90) and GPT-5.4 (85) are the clear choices. Claude Opus 4.6 scores 74 on SWE-bench Pro — notable for being the best non-OpenAI option, but 16 points behind the leader.
Best option: GPT-5.3 Codex
Agentic coding burns tokens fast. Terminal-Bench 2.0 measures performance in terminal-based coding environments — OpenAI models score 90 across the board, Claude Opus 4.6 scores 80.
The cost factor is critical for agents: Claude Opus 4.6 at $15/$75 adds up quickly in agent loops. GPT-5.3 Codex at $2.50/$10 is the far more sustainable choice for agents making hundreds of calls.
Best option: GPT-5.3 Codex (score 90, $2.50/$10). For Claude users: Claude Sonnet 4.6 ($3/$15) over Claude Opus 4.6 for agents.
LiveCodeBench pulls fresh competitive programming problems continuously — GPT-5.4 Pro leads at 86, GPT-5.3 Codex at 85. DeepSeek Coder 2.0 at 45 is a significant drop-off.
Best option: GPT-5.4 Pro if budget allows, GPT-5.3 Codex for value.
No dedicated SQL benchmark exists at frontier level yet. Based on structured output and reasoning scores, GPT-5.4 and GPT-5.3 Codex both handle complex SQL reliably. For batch data pipelines, Gemini 3.1 Pro ($1.25/$5) is the cost-effective choice.
Test generation is underrepresented in benchmarks. Strong SWE-bench performance correlates with good test generation since fixing bugs often requires writing regression tests. GPT-5.3 Codex and GPT-5.4 are both reliable here.
The open-source coding landscape in 2026 is weaker than the frontier for hard software engineering tasks:
For teams that need fully self-hosted models, the quality ceiling for open-source coding in 2026 is considerably below the frontier. The 25-30 point SWE-bench gap between DeepSeek Coder 2.0 and GPT-5.3 Codex is large enough to matter in production.
Need the best possible coding model: GPT-5.4 Pro (SWE-bench Verified 86, $30/$180) or GPT-5.3 Codex (SWE-bench Pro 90, $2.50/$10). For most tasks these are interchangeable at 12x cost difference.
Running an AI coding agent at scale: GPT-5.3 Codex. The agent loop cost math makes $30/$180 unsustainable for most teams.
Claude ecosystem: Claude Opus 4.6 (74 SWE-bench Pro) or Claude Sonnet 4.6 (64 SWE-bench Pro). Both significantly behind GPT-5.3 Codex on coding benchmarks. Worth the tradeoff only if other Claude capabilities matter more for your workflow.
Budget-first coding: GPT-5.3-Codex-Spark ($2/$8, 85 SWE-bench Pro) or GPT-5.2-Codex ($2/$8, 86 SWE-bench Pro). Both outperform Claude Opus 4.6 on SWE-bench Pro at a fraction of the cost.
→ See the full coding leaderboard · Compare SWE-bench scores · LiveCodeBench details
What is the best LLM for coding in 2026? GPT-5.3 Codex — 90 SWE-bench Pro, 85 SWE-bench Verified, 85 LiveCodeBench, at $2.50/$10 per million tokens. Best performance per dollar by a wide margin.
How does Claude compare to GPT for coding? Claude Opus 4.6 scores 74 SWE-bench Pro vs GPT-5.3 Codex's 90 — a 16-point gap. Claude also costs 6x more on input. For pure coding tasks, GPT-5.3 Codex wins on both performance and price.
Is SWE-bench a good benchmark for coding AI? Yes — it's the most reliable coding signal available. It tests real bug-fixing on actual GitHub repositories, not toy functions. HumanEval is saturated (multiple models at 95%) and no longer differentiates frontier models.
What's the best coding model for an AI agent? GPT-5.3 Codex ($2.50/$10). Terminal-Bench 2.0 score of 90, SWE-bench Pro 90, and sustainable cost for high-token agent loops. Claude Opus 4.6 ($15/$75) is too expensive to run at agent scale for most teams.
Should I use GPT-5.4 Pro or GPT-5.3 Codex for coding? GPT-5.3 Codex for almost all use cases. It leads GPT-5.4 Pro on SWE-bench Pro (90 vs 89) and is within 1 point on SWE-bench Verified and LiveCodeBench. GPT-5.4 Pro costs 12x more on input and 18x more on output. The practical performance difference is negligible.
Benchmark scores from BenchLM.ai. Prices per million tokens, current as of March 2026.
On this page
New model releases, benchmark scores, and leaderboard changes. Every Friday.
Free. Your signup is stored with a derived country code for compliance routing.
We ranked every major LLM by coding benchmarks — HumanEval, SWE-bench Verified, and LiveCodeBench. Here's which model actually comes out on top, and why the answer depends on what you're building.
Claude Opus 4.6 vs GPT-5.4 head-to-head: benchmark scores, pricing, and when to use each. GPT-5.4 leads on 16 of 20 benchmarks at 6x lower cost. But Claude holds real advantages in some areas.
Full LLM API pricing comparison for 2026 — input/output token costs for GPT-5, Claude, Gemini, DeepSeek, Grok, and more. Find the cheapest model for your use case.