Claude Opus 4.6 vs GPT-5.4 head-to-head: current benchmark scores, pricing, and where each model actually wins. GPT-5.4 now leads overall, while Claude stays extremely close and still has real workflow-specific advantages.
Share This Report
Copy the link, post it, or save a PDF version.
GPT-5.4 now leads Claude Opus 4.6 on BenchLM's overall leaderboard, 94 to 92. That is the headline change. The more important point is that this is not a blowout. Claude is still extremely close on coding and agentic work, while GPT-5.4 keeps the cleaner edge on overall score, knowledge, math, and price-adjusted practicality.
If you only look at one or two raw benchmarks, you can still make either model look like the winner. GPT-5.4 wins the broader scoreboard. Claude still has real reasons to choose it, especially if your work is writing-heavy, latency-sensitive, or dependent on interaction quality rather than only the headline score.
| Metric | GPT-5.4 | Claude Opus 4.6 |
|---|---|---|
| Overall score | 94 | 92 |
| Overall rank | #3 | #4 |
| Coding score | 90.7 | 90.8 |
| Agentic score | 93.5 | 92.6 |
| Knowledge score | 97.6 | 92.4 |
| Math score | 94.5 | 89.4 |
| Price (in/out) | $2.50 / $15 | $15 / $75 |
| Context window | 1.05M | 1M |
The category-level picture is clearer than the old 85-vs-82 framing ever was. Claude is still basically tied on coding, still close on agentic work, and still easier to justify when response style matters. GPT-5.4 is the stronger broad default because it combines a slightly higher overall score with much stronger cost efficiency and better knowledge depth.
| Benchmark | GPT-5.4 | Claude Opus 4.6 | Gap |
|---|---|---|---|
| HLE | 48 | 53 | +5 Claude |
| GPQA | 92.8 | 91.3 | +1.5 GPT |
| MMLU-Pro | 93 | 82 | +11 GPT |
| SWE-bench Pro | 57.7 | 74 | +16.3 Claude |
| SWE-bench Verified | 84 | 80.8 | +3.2 GPT |
| LiveCodeBench | 84 | 76 | +8 GPT |
| Terminal-Bench 2.0 | 75.1 | 65.4 | +9.7 GPT |
| OSWorld-Verified | 75 | 72.7 | +2.3 GPT |
| BrowseComp | 82.7 | 83.7 | +1 Claude |
| SimpleQA | 97 | 72 | +25 GPT |
| LongBench v2 | 95 | 92 | +3 GPT |
| MRCRv2 | 97 | 92 | +5 GPT |
| IFEval | 96 | 95 | +1 GPT |
| MMMU-Pro | 81.2 | 77.3 | +3.9 GPT |
| OfficeQA-Pro | 96 | 94 | +2 GPT |
The benchmark-level story is mixed. Claude still has the most dramatic single coding win here on SWE-bench Pro, and its HLE lead remains meaningful. GPT-5.4, though, wins more of the widely used broad-purpose rows, especially on knowledge, long-context reasoning, and document-heavy multimodal tasks.
| GPT-5.4 | Claude Opus 4.6 | Ratio | |
|---|---|---|---|
| Input (per million tokens) | $2.50 | $15.00 | Claude is 6x higher |
| Output (per million tokens) | $15.00 | $75.00 | Claude is 5x higher |
At 1M output tokens per month, GPT-5.4 costs $15 and Claude Opus 4.6 costs $75. At 10M output tokens per month, that becomes $150 versus $750. The pricing gap is still the biggest practical reason to choose GPT-5.4.
Use GPT-5.4 if you want the stronger broad default. It is ahead overall, stronger on knowledge and agentic work, and dramatically cheaper.
Use Claude Opus 4.6 if your workflow is writing-heavy, latency-sensitive, or you care about getting a near-GPT-level benchmark profile with a more direct interaction style.
This is now a close-call flagship comparison, not the old "Claude clearly leads GPT-5.4" story. The current data says GPT-5.4 is ahead, but only modestly, and the reasons to pick Claude are still real.
→ Full comparison table · Coding leaderboard · Overall rankings
Is Claude Opus 4.6 better than GPT-5.4? Not on the current overall score. GPT-5.4 leads 94 to 92. Claude still has meaningful strengths in writing-heavy and lower-latency workflows.
Where does Claude Opus 4.6 beat GPT-5.4? Claude's clearest benchmark edges are HLE and SWE-bench Pro. It is also effectively tied on coding category score.
How much does Claude Opus 4.6 cost compared to GPT-5.4? Claude Opus 4.6 is 6x more expensive on input and 5x more expensive on output.
Should I use Claude Opus or Claude Sonnet? Claude Sonnet 4.6 is far cheaper and currently scores 86 overall. Claude Opus 4.6 scores 92, so whether Opus is worth it depends on how expensive mistakes are in your workflow.
What's the best model for coding in 2026? Across the broader coding leaderboard, several specialist rows sit above both of these models. In this specific head-to-head, Claude and GPT-5.4 are nearly tied on coding category score, with GPT-5.4 still stronger on raw SWE-bench Verified and LiveCodeBench.
All benchmark data from BenchLM.ai. Prices per million tokens, current as of April 2026.
These rankings update with every new model. We send one email a week with what moved and why.
Share This Report
Copy the link, post it, or save a PDF version.
On this page
Which models moved up, what’s new, and what it costs. One email a week, 3-min read.
Free. One email per week.
Four frontier LLMs now advertise 1M+ tokens. DeepSeek V4 Pro's 384K output changes generation workflows. Gemini leads effective-context evals. Here's the real comparison.
Three frontier flagships launched in eight days. DeepSeek V4 Pro undercuts GPT-5.5 by ~9x on output price under MIT license. Here's how they compare on benchmarks, cost, and real use.
GPT-5.4 and Gemini 3.1 Pro are one point apart on BenchLM's leaderboard. We compare coding, math, reasoning, multimodal, agentic, pricing, and context window — with Gemini Deep Think as the reasoning wildcard.