A direct benchmark comparison of Anthropic's Claude Opus 4.6 and OpenAI's GPT-5.4 across 22 tests. We break down where each model leads and where benchmarks stop telling the full story.
GPT-5.4 scores 91 overall on our leaderboard. Claude Opus 4.6 scores 90. A one-point gap across 22 benchmarks. If you stopped reading here, you'd pick GPT-5.4 and move on. But that one-point gap hides a more interesting story about what each model is actually good at.
We ran both models through every benchmark we track — coding, math, knowledge, reasoning, instruction following, and multilingual. Here's what the numbers say, and a few things they don't.
| Claude Opus 4.6 | GPT-5.4 | |
|---|---|---|
| Overall Score | 90 | 91 |
| Context Window | 1M tokens | 1M tokens |
| Reasoning Type | Non-Reasoning | Reasoning |
| Arena Elo | 1422 | 1442 |
Both models ship with 1M token context windows. The Elo gap (20 points) looks bigger than the benchmark gap, which tells you human preference and benchmark scores measure different things.
One structural difference worth flagging: GPT-5.4 is classified as a reasoning model, meaning it uses chain-of-thought at inference time. Claude Opus 4.6 is a standard (non-reasoning) model. That makes Opus's near-matching score more impressive — it's doing this without the extra inference compute.
| Benchmark | Claude Opus 4.6 | GPT-5.4 |
|---|---|---|
| HumanEval | 91 | 91 |
| SWE-bench Verified | 80 | 81 |
| LiveCodeBench | 75 | 75 |
Tied on HumanEval. Tied on LiveCodeBench. GPT-5.4 has a single-point edge on SWE-bench Verified (81 vs 80). In practice, the coding gap between these two models is noise.
Worth noting that GPT-5.3 Codex (a coding-specialized variant) scores 85 on SWE-bench and 85 on LiveCodeBench — significantly higher than either general model. If coding is your primary use case, the Codex variant matters more than the Opus-vs-5.4 debate.
| Benchmark | Claude Opus 4.6 | GPT-5.4 |
|---|---|---|
| AIME 2025 | 98 | 98 |
| HMMT 2025 | 96 | 96 |
| BRUMO 2025 | 96 | 96 |
| MATH-500 | 98 | 99 |
Both models score identically on every competition math benchmark except MATH-500, where GPT-5.4 has one point. At this level of performance — both clearing 96% on HMMT and 98% on AIME — the differences are within margin of error. Competition math is essentially solved by both.
| Benchmark | Claude Opus 4.6 | GPT-5.4 |
|---|---|---|
| MMLU | 99 | 99 |
| GPQA | 97 | 97 |
| SuperGPQA | 95 | 95 |
| OpenBookQA | 93 | 93 |
| MMLU-Pro | 92 | 91 |
| HLE | 38 | 46 |
Identical across the board except two benchmarks. Claude takes MMLU-Pro by a point (92 vs 91). GPT-5.4 takes HLE by 8 points (46 vs 38). HLE (Humanity's Last Exam) is the hardest knowledge benchmark we track — questions written by domain experts specifically to stump frontier models. GPT-5.4's lead here is the biggest single-benchmark gap in this comparison.
That 8-point HLE gap is real. It suggests GPT-5.4's reasoning capabilities (chain-of-thought) give it an edge on the hardest questions where working through the problem step by step matters most.
| Benchmark | Claude Opus 4.6 | GPT-5.4 |
|---|---|---|
| SimpleQA | 95 | 95 |
| MuSR | 93 | 93 |
| BBH | 94 | 95 |
One point on BBH (BIG-Bench Hard). Otherwise identical.
Here's what these 22 tests don't measure:
Writing quality. Claude has a reputation for producing more natural prose and following nuanced style instructions. GPT-5.4 is better at structured output and following rigid format specifications. Neither strength shows up in any benchmark we track.
Agent capabilities. Claude Opus 4.6 can spawn sub-agents through Claude Code — Anthropic demonstrated 16 parallel agents building a C compiler. GPT-5.4 operates as a single agent. No benchmark captures this difference, but it matters if you're building agentic workflows.
Cost. GPT-5.4 runs around $2.50/$15 per million input/output tokens. Claude Opus 4.6 costs roughly $5/$25. GPT-5.4 is half the price. For production workloads processing millions of tokens, that 2x cost difference often matters more than a 1-point benchmark gap.
Latency. Reasoning models (GPT-5.4) spend extra time on chain-of-thought before responding. For real-time applications — chatbots, autocomplete, live coding assistants — that latency penalty can be a dealbreaker regardless of benchmark scores.
If you're choosing one model for everything: pick whichever is cheaper or faster for your use case. The benchmark gap is too small to decide on performance alone.
If you can route between models:
The era where one model dominated every category is over. The top of the leaderboard is now separated by single-digit points, and the real differences between models are in the things benchmarks don't test.
All benchmark data is from our leaderboard. Compare these models directly on our comparison page.
On this page