Three frontier flagships launched in eight days. DeepSeek V4 Pro undercuts GPT-5.5 by ~9x on output price under MIT license. Here's how they compare on benchmarks, cost, and real use.
Share This Report
Copy the link, post it, or save a PDF version.
Three frontier flagships landed inside eight days. Anthropic shipped Claude Opus 4.7 on April 16. OpenAI followed with GPT-5.5 on April 23 at $5 input / $30 output per million tokens — double the GPT-5.4 rate. A day later, DeepSeek released V4 Pro under MIT license at $1.74 / $3.48 per million, roughly 9x cheaper than GPT-5.5 on output. Opus 4.7 held its headline pricing flat at $5 / $25. The bill for using the frontier just forked in two directions at once.
TL;DR
- GPT-5.5 — Premium per-token pricing ($5 / $30, doubled from GPT-5.4) and the strongest long-context recall in this trio at 87.5 on MRCRv2 128–256k.
- Claude Opus 4.7 (Adaptive) — Same $5 / $25 per-token headline as Opus 4.6, but a new tokenizer can inflate effective cost; Adaptive covers reasoning-heavy work with the top Arena Elo in the group (1503).
- DeepSeek V4 Pro (Max) — MIT-licensed open weights at $1.74 / $3.48 per million, yet posts the highest overall score in this trio (85) and leads LiveCodeBench at 93.5.
Jump to the summary table for the numbers at a glance.
Three independent frontier flagships arrived in a single benchmark cycle — something that has not happened before in a week this tight.
Any comparison or leaderboard older than this week is stale. GPT-5.4 numbers no longer reflect OpenAI's current price point. Opus 4.6 rows do not capture the tokenizer shift. DeepSeek V3 is already a generation behind. The rest of this post uses data captured today against the three live flagship rows, with per-token pricing and benchmark scores pulled from BenchLM's current snapshot.
| Category | GPT-5.5 | Claude Opus 4.7 (Adaptive) | DeepSeek V4 Pro (Max) | Winner |
|---|---|---|---|---|
| Overall Score | 82 | 73 | 85 | DeepSeek V4 Pro (Max) |
| Arena Elo | — | 1503.08 | — | Claude Opus 4.7 (Adaptive) |
| SWE-bench Verified | — | 87.6 | 80.6 | Claude Opus 4.7 (Adaptive) |
| LiveCodeBench | — | — | 93.5 | DeepSeek V4 Pro (Max) |
| GPQA Diamond | 93.6 | 94.2 | 90.1 | Claude Opus 4.7 (Adaptive) |
| HLE | 52.2 | 54.7 (with tools) | 37.7 | Claude Opus 4.7 (Adaptive) |
| MRCRv2 (128–256k) | 87.5 | 59.2 | — | GPT-5.5 |
| Reasoning type | Reasoning | Reasoning (Adaptive) | Reasoning (Max effort) | — |
| Price (input / output $/M) | $5 / $30 | $5 / $25 | $1.74 / $3.48 | DeepSeek V4 Pro (Max) |
| Cache-hit input $/M | — (Batch 50%) | up to 90% savings ($0.50 effective) | $0.145 | DeepSeek V4 Pro (Max) |
| Context Window | 1M | 1M | 1M | All tied at 1M |
| Max Output Tokens | 128K | 64K | 384K | DeepSeek V4 Pro (Max) |
| Source Type | Proprietary | Proprietary | Open Weight (MIT) | DeepSeek V4 Pro (Max) |
"Overall Score" is BenchLM's blended score computed from coding, reasoning, math, agentic, knowledge, and multimodal benchmark results — and the gaps between these three are small (12 points top to bottom). The "Winner" column is within-this-trio only: Gemini 3.1 Pro sits outside this comparison and would reshuffle several rows if included. Note that Opus 4.7's non-Adaptive row has no public benchmark data yet — the Adaptive variant is used throughout this post as the thinking-mode apples-to-apples. Below we unpack coding, reasoning, pricing, and open-source tradeoffs in that order.
| Coding benchmark | GPT-5.5 | Claude Opus 4.7 (Adaptive) | DeepSeek V4 Pro (Max) |
|---|---|---|---|
| SWE-bench Verified | — | 87.6 | 80.6 |
| SWE-bench Pro | — | 64.3 | — |
| LiveCodeBench | — | — | 93.5 |
| Terminal-Bench 2 | — | 69.4 | — |
GPT-5.5. OpenAI's newest flagship carries an overall score of 82, but its coding rows are still empty on BenchLM. OpenAI's launch materials describe GPT-5.5 as a reasoning model with a step up in coding over GPT-5.4, and the overall score reflects strong performance on what has been published so far (93.6 on GPQA Diamond, 87.5 on MRCRv2 at 128–256k). The specific coding benchmarks we care about here — SWE-bench Verified, SWE-bench Pro, LiveCodeBench, Terminal-Bench 2 — have no public numbers for GPT-5.5 yet. We are not going to invent a score. As third-party evaluations land, the table above will update.
Claude Opus 4.7 (Adaptive). Opus Adaptive posts the strongest published coding number in this trio: 87.6 on SWE-bench Verified, which measures end-to-end patch generation on real GitHub issues. It backs that up with 64.3 on the harder SWE-bench Pro variant and 69.4 on Terminal-Bench 2, which tests multi-step terminal work. Taken together, that is the most direct evidence of repo-scale engineering capability in this group. Adaptive mode is designed to spend more reasoning tokens when the problem warrants it, which helps on the kind of multi-file changes SWE-bench Verified actually measures — and which is where the gap over DeepSeek V4 Pro (Max) shows up.
DeepSeek V4 Pro (Max). DeepSeek leads the trio on LiveCodeBench at 93.5 — the highest published LiveCodeBench score across these three flagships. It lands at 80.6 on SWE-bench Verified: below Opus Adaptive's 87.6, but a respectable number for an MIT-licensed open-weight model. These are reproducible: anyone with compute can run V4 Pro (Max) and verify the numbers themselves, which is a meaningfully different story than a closed API row. The practical implication is that a self-hostable model now competes on coding benchmarks that, until recently, were dominated by closed frontier APIs.
Picking for a repo-wide refactor across 50 files. The honest pick right now is Opus Adaptive. SWE-bench Verified at 87.6 is the closest public proxy for this workload, and Opus leads it. DeepSeek V4 Pro (Max) is a strong secondary pass on individual files — LiveCodeBench-style snippet generation plays to its 93.5 — and its open weights make it attractive for bulk work where latency and cost matter. GPT-5.5 stays a backup until its coding scores surface on BenchLM.
| Benchmark | GPT-5.5 | Claude Opus 4.7 (Adaptive) | DeepSeek V4 Pro (Max) |
|---|---|---|---|
| GPQA Diamond | 93.6 | 94.2 | 90.1 |
| HLE | 52.2 | 54.7 (with tools) | 37.7 |
| FrontierMath | — | 43.8 | — |
| ARC-AGI-2 | — | 75.8 | — |
GPT-5.5. GPT-5.5 lands at 93.6 on GPQA Diamond, a statistical tie with Opus Adaptive's 94.2 once you account for the noise floor on a benchmark this compressed. On HLE it scores 52.2 — below Opus Adaptive's 54.7 (with tools) but well ahead of DeepSeek V4 Pro (Max) at 37.7. The long-context reasoning numbers from the hero matter here too: 87.5 on MRCRv2 at 128–256k and 83.1 at 64–128k are the highest in the trio, and multi-step reasoning over long inputs is exactly where that recall shows up. Math-specific AIME and MATH-500 rows are still empty on BenchLM for 5.5.
Claude Opus 4.7 (Adaptive). Opus Adaptive tops the trio on both frontier knowledge rows: 94.2 on GPQA Diamond and 54.7 on HLE with tools (46.9 no-tools). The HLE number is the best in this group — no model in this comparison beats it. FrontierMath at 43.8 is a useful data point on a harder, more recent math benchmark where most models still sit in the single digits or low twenties. ARC-AGI-2 at 75.8 backs up abstract-reasoning coverage. Adaptive's design — spend more reasoning tokens when the problem warrants it — pairs well with HLE-style multi-step science questions where the marginal token earns its keep.
DeepSeek V4 Pro (Max) — thinking-mode cost math. DeepSeek V4 Pro (Max) posts 90.1 on GPQA Diamond — within striking distance of the other two — and 37.7 on HLE, a meaningful gap on the harder test. The pricing footnote worth pinning down: Max effort does not change the per-token price. It changes how many tokens the model emits. A hard reasoning question that takes ~2k output tokens in a non-thinking answer can expand to ~8k tokens once the reasoning trace is included. At $3.48 per million output tokens, the effective cost scales from 2,000 × $3.48/1M = $0.007 to 8,000 × $3.48/1M = $0.028 per query. Compare GPT-5.5 at $30 per million output: 2,000 tokens = $0.06, 8,000 tokens = $0.24 — roughly 9x more than DeepSeek at Max, even after DeepSeek quadruples its token count. Thinking mode is a volume multiplier, not a price multiplier. The cost advantage holds.
Opus 4.7 Adaptive vs base. Anthropic structures Adaptive as an effort-controlled variant of the same underlying Opus 4.7 model. Per-token pricing is identical — $5 input / $25 output per million. What changes between Adaptive and the non-Adaptive row is how many reasoning tokens you pay for, same price-per-token math as the DeepSeek case above. If your workload doesn't need reasoning, the non-Adaptive Opus 4.7 row is your cost floor. That row's public benchmark coverage is still populating, which is why this post uses Adaptive throughout — it's the apples-to-apples thinking-mode comparison against GPT-5.5 and DeepSeek V4 Pro (Max).
OpenAI priced GPT-5.5 at $5 input / $30 output per million tokens (sourced from OpenAI's pricing page). GPT-5.4 was $2.50 / $15 — so the new flagship is exactly 2x on both sides of the ledger. OpenAI has framed the step-up as pricing for "a new class of intelligence" rather than a simple version bump, and the sticker reflects that positioning. Batch and Flex endpoints halve that rate ($2.50 / $15 at batch), matching the old GPT-5.4 standard price. Priority pricing sits at 2.5x the standard rate — $12.50 input / $75 output — for low-latency tiers.
Per-token, Opus 4.7 is unchanged from Opus 4.6: $5 input / $25 output per million. The catch is what "per token" means. Opus 4.7 ships with a new tokenizer that produces up to 35% more tokens on equivalent input text. A prompt that billed as 5,000 tokens on 4.6 can bill as ~6,750 tokens on 4.7 — same content, same response, different meter. Simon Willison's token counter analysis and Finout's pricing deep-dive have both confirmed this effect across typical workloads.
Worked example. A 5,000-token Opus 4.6 prompt cost $0.025 to feed in ($5 × 5000 / 1,000,000). The same prompt under Opus 4.7's tokenizer runs ~6,750 tokens, costing $0.034 — a 35% effective increase even though the sticker rate is identical. Prompt caching can claw back up to 90% on hits, so the effective cache-hit rate lands near $0.50/M input on repeated prompts.
DeepSeek priced V4 Pro at $1.74 input (cache miss) / $3.48 output per million tokens, regardless of thinking mode. Cache hits drop input to $0.145 per million — about 92% savings on repeated context. Thinking mode does not change the sticker price; it changes token volume, which we covered in the reasoning section.
For a chatbot with a 2,000-token fixed system prompt reused across 10,000 requests per day, the cache hit rate effectively moves the entire system-prompt cost from $35/month to $2.90/month. That pattern — reusable context at volume — is where V4 Pro's pricing advantage compounds, and where closed APIs with weaker caching structurally lose ground.
OpenAI: Batch = 50% off standard, Flex = 50% off, Priority = 2.5x standard. Batch is asynchronous; Priority targets latency-critical workflows.
Anthropic: Batch API at 50% off standard, prompt caching up to 90% off input on hits, message-level batching in the API.
DeepSeek: No separate batch tier — the cache-hit discount functions as the main cost-reduction lever.
The table below computes cost = input_tokens × in_price + output_tokens × out_price for three realistic monthly volumes. Opus 4.7's "effective input" column models the 35% tokenizer inflation on the same underlying content vs Opus 4.6.
| Monthly volume (input / output) | GPT-5.5 | Claude Opus 4.7* | DeepSeek V4 Pro (Max) |
|---|---|---|---|
| 1M / 200K | $11.00 | ~$11.75 | $2.44 |
| 10M / 2M | $110.00 | ~$117.50 | $24.36 |
| 100M / 20M | $1,100.00 | ~$1,175.00 | $243.60 |
*Opus column assumes 35% tokenizer inflation vs Opus 4.6-equivalent content (i.e., input tokens billed are 1.35x the effective-content count at the same $5/M rate). Cache-hit discounts are not applied in this column.
At 100M input / 20M output per month, DeepSeek V4 Pro (Max) costs ~$244 — GPT-5.5 costs ~$1,100 for the same workload, ~4.5x more. Opus 4.7 lands at ~$1,175 after the tokenizer adjustment, slightly above GPT-5.5. Cache-hit rates flip these numbers hard: if 80% of your input hits DeepSeek's cache, monthly input cost drops from $174 to ~$37, and Opus's prompt cache can similarly claw back up to 90% on repeated prompts. For hobby-scale usage (1M/200K), the dollar gaps are trivial — a coffee's difference. At 100M monthly input, the gap is a full-time employee.
Price is half the DeepSeek story. The other half is that V4 Pro ships under a real open-source license with weights you can download, modify, and run. GPT-5.5 and Opus 4.7 cannot match that at any price.
DeepSeek V4 Pro is MIT-licensed on Hugging Face. MIT is one of the least restrictive licenses in use: commercial deployment, derivatives, resale, and redistribution are all permitted without a separate grant. There is no acceptable-use rider, no field-of-use carveout, and no user-count trigger. Attribution is the only real obligation.
MIT is a harder guarantee than "open-weight" models under custom licenses. Meta's Llama license auto-revokes if the products you ship with it exceed 700 million monthly active users — the clause counts total product MAUs, not AI feature usage, so a large platform can trip it without anyone touching the model. Llama also carries an acceptable-use policy and a ban on training competing models. Closed APIs bring a different cost: data-handling terms, rate limits, and deprecation risk. If OpenAI sunsets GPT-5.5 in 18 months, your pipeline breaks on their timetable. If DeepSeek disappears tomorrow, you still have the weights.
V4 Pro is a 1.6T-param MoE with 49B active per token. Total memory — not active params — sets the serving floor, because the router can reach any expert on any token. Public deployment guides put the weights at roughly 862GB in mixed FP4/FP8 precision, which makes the practical floor one 8x H100 80GB node with NVLink (or 8x H200), running under vLLM with expert parallelism. Aggressive quantization can shrink that, at the usual quality cost.
Break-even against the API is a throughput question. An 8x H100 node runs roughly $16–$32 per hour on-demand. At DeepSeek's $1.74 / $3.48 per million rates, sustained workloads in the ballpark of 100M+ output tokens per day are where self-hosting starts beating the API on pure dollars — below that, the API wins on cost and operational overhead. Our self-host calculator (shipping soon) will take your actual mix and cache-hit rate and return the crossover point; the numbers above are approximations, not a quote.
OpenAI and Anthropic offer managed fine-tuning on a narrow set of smaller or older models — not full-weight access to GPT-5.5 or Opus 4.7. You cannot pull those weights, run full-parameter fine-tuning, or distill them into a model you own. DeepSeek V4 Pro's weights are downloadable, which puts full fine-tuning, LoRA, QLoRA, and distillation into student models all on the table. For teams with proprietary data in a domain the base model covers weakly — legal corpora, internal codebases, clinical notes — that is a real moat the closed flagships cannot offer.
Self-hosting removes the third-party data processor entirely. No cross-border data transfer, no vendor inside your GDPR, HIPAA, SOC 2, or FedRAMP boundary, no sub-processor list to audit. Regulated buyers — healthcare, legal, finance, defense — often cannot use GPT-5.5 or Opus 4.7 without a BAA or enterprise contract, and even then the data leaves the perimeter. An MIT-licensed model on hardware you control sidesteps the whole conversation. The compliance story is not "save money." It's "ship at all."
All three of these models can run fast or slow depending on which tier you call. The tradeoff is the same shape across vendors: more reasoning means later first token and slower steady-state throughput, in exchange for better answers on hard problems.
GPT-5.5 exposes a base tier and a reasoning tier as separate API options. Base GPT-5.5 streams a first token quickly and runs at full output throughput — good for chat, autocomplete, and short-turn tool use. The reasoning tier adds an internal thinking phase before any visible output, which pushes first-token latency from subsecond into multi-second territory and burns reasoning tokens you pay for.
Claude Opus 4.7 Adaptive picks reasoning depth dynamically per request. On a light prompt it behaves like a non-reasoning model — first token feels close to interactive. On a hard prompt it spends thinking budget before responding, landing between base GPT-5.5 and an explicit thinking mode on first-token latency. The "Adaptive" piece is doing real work here: you do not have to pre-commit a tier per call.
DeepSeek V4 Pro Max is a thinking-mode product. It always runs a visible reasoning pass before the final answer, which means first-token latency is in seconds — not subseconds — and steady-state tokens-per-second is below the non-reasoning tiers of the other two. You trade speed for the LiveCodeBench and GPQA Diamond numbers in section 4.
Artificial Analysis tracks first-token latency and throughput across these models, but the April 2026 numbers for V4 Pro Max and the GPT-5.5 reasoning tier are still settling — treat any single snapshot as directional.
The split that matters in product work is chat vs job. Chat models — Opus 4.7 Adaptive on light prompts, GPT-5.5 base — feel subsecond and let a human iterate. Job models — DeepSeek V4 Pro Max thinking, GPT-5.5 reasoning, Opus 4.7 on hard problems — are throw-it-and-wait. The right latency tier depends on whether a human is waiting.
| Use case | GPT-5.5 | Claude Opus 4.7 Adaptive | DeepSeek V4 Pro |
|---|---|---|---|
| Budget-sensitive production | ✓ | ||
| Regulated data / self-host required | ✓ | ||
| Long-form writing / editing | ✓ | ||
| Complex reasoning / research | ✓ | ✓ (thinking) | |
| Repo-scale coding agent | ✓ | ||
| Highest raw knowledge recall | ✓ | ||
| Fine-tuning for a domain | ✓ | ||
| Latency-sensitive chat UX | ✓ | ✓ |
The matrix follows the benchmark picture. Opus 4.7 Adaptive's 87.6 on SWE-bench Verified is the strongest agentic-coding number in this trio, which is why it owns repo-scale coding agents and writing-heavy work where its non-reasoning behavior on light prompts keeps the loop tight. DeepSeek V4 Pro Max's 93.5 on LiveCodeBench and 90.1 on GPQA Diamond put it level with the closed flagships on hard reasoning when thinking is on, while its price and MIT license make it the only sensible default for budget-sensitive production, regulated workloads, and any team that wants to fine-tune on proprietary data. GPT-5.5 leads on HLE and on raw knowledge recall, which is where it earns its slot.
The more useful question in April 2026 is not "which one" — it's how to mix two. A defensible default: route 80% of traffic to DeepSeek V4 Pro (coding, data transforms, routine Q&A, anything where the price gap compounds) and send writing-heavy or long-form-edit work to Opus 4.7 Adaptive. Keep GPT-5.5 in the loop for knowledge-recall queries and HLE-style hard reasoning where it still leads. The TCO table in the pricing section shows why "pick one flagship" is the wrong frame at this point: the cost delta between DeepSeek V4 Pro and the closed flagships at 100M monthly input is roughly 4.5x, large enough that even a crude router pays for itself on day one.
Which is the best AI model in April 2026? It depends on the workload, and there is no single winner across this trio. DeepSeek V4 Pro Max tops BenchLM's overall blended score at 85 and leads LiveCodeBench at 93.5 in thinking mode. GPT-5.5 leads on raw knowledge recall (HLE 52.2) and long-context retrieval (MRCRv2 87.5 at 128–256k). Claude Opus 4.7 Adaptive leads on agentic coding (SWE-bench Verified 87.6), Arena Elo (1503), and writing quality. The 12-point spread between top and bottom on overall score is small — pick by use case rather than a universal ranking.
Is DeepSeek V4 Pro really comparable to GPT-5.5 and Claude Opus 4.7? Yes on most benchmarks. V4 Pro Max posts 93.5 on LiveCodeBench, 90.1 on GPQA Diamond, and 80.6 on SWE-bench Verified — competitive with both closed flagships and ahead of GPT-5.5 on the coding category overall. It trails meaningfully on HLE (37.7 vs GPT-5.5's 52.2), so knowledge-recall queries are the one place it loses cleanly. Caveat: those scores require thinking mode, which adds first-token latency measured in seconds rather than subseconds. The asymmetric advantage closed flagships cannot match is open weights under MIT license.
How much cheaper is DeepSeek V4 Pro than GPT-5.5? About 2.9x cheaper on input ($1.74 vs $5.00 per million tokens) and 8.6x cheaper on output ($3.48 vs $30.00 per million). At a typical 5:1 input:output workload the total bill runs roughly 4.5x lower — see the TCO table for exact dollar figures at 1M, 10M, and 100M monthly volumes. Cache hits drop V4 Pro's input price to $0.145 per million, about a 92% saving on repeated prompts. Self-hosting changes the math entirely once sustained throughput crosses roughly 100M output tokens per day — see the open-source section.
Is Claude Opus 4.7 actually more expensive than Opus 4.6? The per-token rate is unchanged at $5 input / $25 output for the Adaptive tier. The new tokenizer produces roughly 35% more tokens for the same prompt, so effective bills run about 35% higher on identical content. A 5,000-token prompt under the old tokenizer becomes around 6,750 tokens under the new one, billed at the same rate. Anthropic did not change the price card; they changed the meter. Cache-hit discounts of up to 90% still apply on repeated prompts, so production workloads with high cache hit rates eat less of the inflation than first-call workloads do.
Can I self-host GPT-5.5 or Claude Opus 4.7? No. OpenAI and Anthropic do not release model weights. Both run managed fine-tuning programs on a narrow set of older or smaller models, but base weights, tokenizer, and inference stack all stay on their infrastructure. DeepSeek V4 Pro's MIT-licensed weights are downloadable from Hugging Face — the only one of the three you can run on your own hardware, fine-tune end-to-end, distill into a smaller student model, or air-gap behind a regulated perimeter. For healthcare, legal, finance, and defense buyers who cannot send data to a third-party processor, that is the deciding line.
Head-to-head comparisons across this trio:
Pricing and cost analysis:
Model pages with full benchmark coverage:
Open-source landscape:
The numbers in this post move every release cycle. Compare any two of these models head-to-head, filter by benchmark, or sort by price-per-point on the BenchLM leaderboard and comparison explorer — the rows update as new flagships ship.
These rankings update with every new model. We send one email a week with what moved and why.
Share This Report
Copy the link, post it, or save a PDF version.
On this page
Which models moved up, what’s new, and what it costs. One email a week, 3-min read.
Free. One email per week.
Four frontier LLMs now advertise 1M+ tokens. DeepSeek V4 Pro's 384K output changes generation workflows. Gemini leads effective-context evals. Here's the real comparison.
Claude Opus 4.6 vs GPT-5.4 head-to-head: current benchmark scores, pricing, and where each model actually wins. GPT-5.4 now leads overall, while Claude stays extremely close and still has real workflow-specific advantages.
GPT-5.4 and Gemini 3.1 Pro are one point apart on BenchLM's leaderboard. We compare coding, math, reasoning, multimodal, agentic, pricing, and context window — with Gemini Deep Think as the reasoning wildcard.