comparisonclaudegptbenchmarkscoding

Claude Opus 4.6 vs GPT-5.4: Where Each Model Actually Wins

A direct benchmark comparison of Anthropic's Claude Opus 4.6 and OpenAI's GPT-5.4 across 22 tests. We break down where each model leads and where benchmarks stop telling the full story.

Glevd·March 7, 2026·8 min read

GPT-5.4 scores 91 overall on our leaderboard. Claude Opus 4.6 scores 90. A one-point gap across 22 benchmarks. If you stopped reading here, you'd pick GPT-5.4 and move on. But that one-point gap hides a more interesting story about what each model is actually good at.

We ran both models through every benchmark we track — coding, math, knowledge, reasoning, instruction following, and multilingual. Here's what the numbers say, and a few things they don't.

The headline numbers

Claude Opus 4.6 GPT-5.4
Overall Score 90 91
Context Window 1M tokens 1M tokens
Reasoning Type Non-Reasoning Reasoning
Arena Elo 1422 1442

Both models ship with 1M token context windows. The Elo gap (20 points) looks bigger than the benchmark gap, which tells you human preference and benchmark scores measure different things.

One structural difference worth flagging: GPT-5.4 is classified as a reasoning model, meaning it uses chain-of-thought at inference time. Claude Opus 4.6 is a standard (non-reasoning) model. That makes Opus's near-matching score more impressive — it's doing this without the extra inference compute.

Coding: GPT-5.4 leads, but look at which benchmarks

Benchmark Claude Opus 4.6 GPT-5.4
HumanEval 91 91
SWE-bench Verified 80 81
LiveCodeBench 75 75

Tied on HumanEval. Tied on LiveCodeBench. GPT-5.4 has a single-point edge on SWE-bench Verified (81 vs 80). In practice, the coding gap between these two models is noise.

Worth noting that GPT-5.3 Codex (a coding-specialized variant) scores 85 on SWE-bench and 85 on LiveCodeBench — significantly higher than either general model. If coding is your primary use case, the Codex variant matters more than the Opus-vs-5.4 debate.

Math: dead heat

Benchmark Claude Opus 4.6 GPT-5.4
AIME 2025 98 98
HMMT 2025 96 96
BRUMO 2025 96 96
MATH-500 98 99

Both models score identically on every competition math benchmark except MATH-500, where GPT-5.4 has one point. At this level of performance — both clearing 96% on HMMT and 98% on AIME — the differences are within margin of error. Competition math is essentially solved by both.

Knowledge: Claude has a slight edge on MMLU-Pro

Benchmark Claude Opus 4.6 GPT-5.4
MMLU 99 99
GPQA 97 97
SuperGPQA 95 95
OpenBookQA 93 93
MMLU-Pro 92 91
HLE 38 46

Identical across the board except two benchmarks. Claude takes MMLU-Pro by a point (92 vs 91). GPT-5.4 takes HLE by 8 points (46 vs 38). HLE (Humanity's Last Exam) is the hardest knowledge benchmark we track — questions written by domain experts specifically to stump frontier models. GPT-5.4's lead here is the biggest single-benchmark gap in this comparison.

That 8-point HLE gap is real. It suggests GPT-5.4's reasoning capabilities (chain-of-thought) give it an edge on the hardest questions where working through the problem step by step matters most.

Reasoning: small GPT-5.4 advantage

Benchmark Claude Opus 4.6 GPT-5.4
SimpleQA 95 95
MuSR 93 93
BBH 94 95

One point on BBH (BIG-Bench Hard). Otherwise identical.

Where benchmarks stop being useful

Here's what these 22 tests don't measure:

Writing quality. Claude has a reputation for producing more natural prose and following nuanced style instructions. GPT-5.4 is better at structured output and following rigid format specifications. Neither strength shows up in any benchmark we track.

Agent capabilities. Claude Opus 4.6 can spawn sub-agents through Claude Code — Anthropic demonstrated 16 parallel agents building a C compiler. GPT-5.4 operates as a single agent. No benchmark captures this difference, but it matters if you're building agentic workflows.

Cost. GPT-5.4 runs around $2.50/$15 per million input/output tokens. Claude Opus 4.6 costs roughly $5/$25. GPT-5.4 is half the price. For production workloads processing millions of tokens, that 2x cost difference often matters more than a 1-point benchmark gap.

Latency. Reasoning models (GPT-5.4) spend extra time on chain-of-thought before responding. For real-time applications — chatbots, autocomplete, live coding assistants — that latency penalty can be a dealbreaker regardless of benchmark scores.

So which one should you use?

If you're choosing one model for everything: pick whichever is cheaper or faster for your use case. The benchmark gap is too small to decide on performance alone.

If you can route between models:

  • GPT-5.4 for tasks that need deep multi-step reasoning, especially on hard knowledge questions (HLE-type problems), and when you want lower cost per token.
  • Claude Opus 4.6 for coding assistance, writing tasks, and agentic workflows where sub-agent coordination matters. Especially useful when you need fast responses without chain-of-thought latency.
  • GPT-5.3 Codex if coding is your primary use case. It outperforms both general models on SWE-bench and LiveCodeBench by a clear margin.

The era where one model dominated every category is over. The top of the leaderboard is now separated by single-digit points, and the real differences between models are in the things benchmarks don't test.


All benchmark data is from our leaderboard. Compare these models directly on our comparison page.