Four frontier LLMs now advertise 1M+ tokens. DeepSeek V4 Pro's 384K output changes generation workflows. Gemini leads effective-context evals. Here's the real comparison.
Share This Report
Copy the link, post it, or save a PDF version.
All four frontier flagships now advertise 1M+ token context windows. The headline number is solved. What still differs sharply is how much of that window the model can actually use in practice (effective context), how much it can write out in a single call (output ceiling), and what it costs to run at length. This post compares Gemini 3.1 Pro, GPT-5.5, Claude Opus 4.7 Adaptive, and DeepSeek V4 Pro across all three axes — using the live April 2026 numbers, not last quarter's.
TL;DR
- Gemini 3.1 Pro — Strongest effective-context scores at 500K-1M; tiered pricing penalizes >200K usage.
- GPT-5.5 — Highest LongBench v2 / MRCRv2 scores; 128K output is fine for most generation but not all.
- Claude Opus 4.7 Adaptive — Mid-context interactive sweet spot; 90% prompt caching is the cost lever.
- DeepSeek V4 Pro — The only one with a 384K output ceiling; cheapest input/output by a wide margin; cache hit drops input to $0.145/M.
Jump to the comparison table to see the numbers, or read on for what the advertised window actually buys you in April 2026.
| Model | Window | Max output | LongBench v2 | MRCRv2 (128-256K) | NIAH @1M | Input $/M | Cached $/M |
|---|---|---|---|---|---|---|---|
| Gemini 3.1 Pro | 1,000,000 | 64,000 | — | — | — | $2.00 (≤200K) / $4.00 (>200K) | — |
| GPT-5.5 | 1,000,000 | 128,000 | — | 87.5 | — | $5.00 | ~$2.50 batch / 90% cache |
| Claude Opus 4.7 Adaptive | 1,000,000 | 64,000 | — | — | — | $5.00 | $0.50 (90% prompt cache) |
| DeepSeek V4 Pro | 1,000,000 | 384,000 | — | — | — | $1.74 | $0.145 (cache hit) |
A 1M-token window is now table stakes across the frontier. All four flagships advertise effectively the same input ceiling, so the headline number stops being a differentiator. The interesting differences live in the columns to the right.
The output ceiling is where the four diverge. DeepSeek V4 Pro's 384K max output is 3-6x larger than its peers — Gemini and Opus cap at 64K, GPT-5.5 at 128K. For workflows that generate long artifacts in a single call (full report drafts, large code translations, batch document rewrites), 384K is a structural advantage that no amount of prompt-side context fixes.
Pricing tiers and caching mechanics are the other lever. Gemini 3.1 Pro charges $2.00/M up to 200K input tokens and $4.00/M above — the only model in this group with a window-position-dependent input price. GPT-5.5 lists $5.00/M with the standard ~50% off via Batch and prompt caching available. Claude Opus 4.7 Adaptive lists $5.00/M with up to 90% off on prompt-cache hits — the steepest cache discount in the group, landing effective input near $0.50/M. DeepSeek V4 Pro lists $1.74/M with cache hits at $0.145/M — about 92% off, and the cheapest advertised cache rate in the frontier today.
A note on missing cells: BenchLM's coverage of LongBench v2, MRCRv2, and NIAH @1M is incomplete in April 2026, and we are not going to fabricate scores. The only published row in this set is GPT-5.5 at 87.5 on MRCRv2 (128–256K). Section 3 (advertised vs effective) and section 4 (output ceiling) below pull from primary-source citations where benchmark coverage exists, and flag where it does not.
The advertised window is the size of the box. The effective window is how much of that box the model can actually reason over without losing precision. They are not the same number, and the gap is larger than vendor pages imply. A model that ships with a 1M-token window but degrades past 200K is still a "1M context model" on the spec sheet — the marketing number does not move when the recall curve falls off a cliff.
The public eval surface has settled around four tests. NIAH (needle-in-a-haystack) inserts a single fact deep in a long document and asks the model to retrieve it — a pure recall probe. LongBench v2 is a multi-document QA and reasoning suite that mixes retrieval with synthesis across long inputs. MRCRv2 (multi-round coreference resolution) asks the model to track entities across many turns of a long conversation. RULER is a synthetic battery of long-context probes that stresses different failure modes. These are the public surfaces; vendor-internal evals exist but don't compare across labs.
What we have for the four April-2026 flagships in primary sources is thin. GPT-5.5's MRCRv2 at 128–256K is published at 87.5 in BenchLM's research snapshot, with its 64–128K row at 83.1 — strong numbers, among the highest reported for any frontier model on coreference at that range. Beyond that cell, LongBench v2, NIAH @1M, and RULER scores for the April-2026 checkpoints are sparse. Most public scores you can find today are still on predecessor models — Gemini 2.5 Pro, GPT-5, Opus 4.6, DeepSeek V3.2 — not the 3.1 Pro / 5.5 / 4.7 / V4 Pro line. Vendor self-reports for the new models often claim "near-100% NIAH @1M," but reproducible third-party numbers for these specific checkpoints have not all dropped yet. We are not going to invent precision percentages to fill that gap.
Historical lead patterns are more useful than fake numbers. Gemini has historically led NIAH and effective-context evals through the 2.5 Pro generation; whether 3.1 Pro extends that lead against GPT-5.5 is an open question waiting on public eval drops. GPT-5.5's 87.5 on MRCRv2 (128–256K) is the strongest published coreference number in this group. Anthropic has taken a different design path entirely: prompt caching that makes cached tokens up to 90% cheaper, biasing workflow design (load once, query many times) more than it moves the effective-context score. DeepSeek V4 Pro's launch paper claims solid effective-context performance, but it is the newest model here with the least third-party verification — track Artificial Analysis for numbers as they land.
Why the gap exists at all: attention dilutes as sequence length grows, positional encodings drift past the training distribution, and the "lost in the middle" effect means tokens at the start and end of context get more attention than tokens in the middle. Stack those three and you get the curve where a model nominally indexes 1M tokens but reasons reliably over far less.
Practical rule of thumb: if you are filling more than 80% of an advertised window, run a smoke test on your data before committing. Pick five queries where the answer lives at depths from 10% to 90% of context and check that the model finds them. The eval trio approximates real workloads; it does not replace them.
Output capacity is the constraint most teams discover on day two — it's the next section.
Most context-window discussion focuses on the input side — how much you can feed in. That framing breaks the moment you need to generate a long artifact in a single call. Output ceilings vary by 6x across this group, and the differences shape what kind of work is possible at all.
| Model | Max output (single call) |
|---|---|
| Gemini 3.1 Pro | 64,000 tokens |
| GPT-5.5 | 128,000 tokens |
| Claude Opus 4.7 Adaptive | 64,000 tokens |
| DeepSeek V4 Pro | 384,000 tokens |
DeepSeek V4 Pro's 384K output ceiling is 3x GPT-5.5's and 6x both Gemini 3.1 Pro and Opus 4.7 Adaptive. That gap turns several workloads from "stitch multiple calls together with handoff state" into "one shot." Legal drafting: book-length contracts, full briefs, and consolidated case files fit in a single response. Long-form research synthesis — 50 to 100 page reports drawn from a large corpus — fits cleanly. Translation pipelines for novel-length texts (a typical English novel runs ~80K–150K tokens) move from chunked workflow to single-pass. Codebase rewrites where the entire output is the deliverable stop needing a stitching layer. Without 384K of output, every one of these becomes a multi-call orchestration problem with continuity state between chunks.
GPT-5.5's 128K is fine for most real work. A typical long-form report, a detailed technical doc, a fully-fleshed PR description — all fit comfortably with margin. The cases that don't fit are the long-tail: legal-grade drafts, novel translations, large multi-section reports.
Opus 4.7 Adaptive and Gemini 3.1 Pro at 64K cover a narrower set. Drafting a 30K-word article fits with room to spare. A 50K-word technical report does not in one call — you'll need to plan a multi-turn structure or accept stitching overhead. For the interactive workflows both models target, 64K is rarely the binding constraint; for single-shot generation it sets a real ceiling.
Worked example. Generating a 50,000-word technical report runs roughly 65K–70K output tokens (English averages ~1.3 tokens per word). Single-call: only DeepSeek V4 Pro and GPT-5.5 can attempt it; Opus and Gemini hit their ceiling first. Multi-call: every model can do it, but you pay context-rebuilding overhead between chunks and risk continuity drift in tone, structure, and cross-references. For output-heavy single-shot work, DeepSeek V4 Pro is in a different tier.
The output ceiling tells you what's possible in one call. The next section covers what it costs to actually run at length.
Input cost scales linearly with context size, but the per-token rate isn't constant across all four models. Gemini's tier inflection at 200K, Opus's tokenizer inflation, and DeepSeek's cache discount each warp the curve in a different direction. The table below shows raw single-call input cost for the four flagships at 50K / 200K / 500K / 1M tokens — no caching, no batch discount, no output cost.
| Input size | Gemini 3.1 Pro | GPT-5.5 | Opus 4.7 Adaptive | DeepSeek V4 Pro |
|---|---|---|---|---|
| 50K tokens | $0.10 | $0.25 | $0.34* | $0.087 |
| 200K tokens | $0.40 | $1.00 | $1.35* | $0.348 |
| 500K tokens | $2.00 | $2.50 | $3.38* | $0.870 |
| 1M tokens | $4.00 | $5.00 | $6.75* | $1.740 |
* Opus 4.7 Adaptive figures shown adjusted by ~35% to reflect tokenizer inflation on equivalent content vs the other three models — the same prompt produces more tokens. Raw per-token cost matches GPT-5.5 at $5/M; effective bills run higher. See the pricing section in our flagship comparison post for the derivation.
Gemini's tier inflection at 200K is the most aggressive pricing cliff in the group. Below 200K, Gemini 3.1 Pro is the cheapest closed flagship — $0.10 to read 50K tokens, $0.40 at the threshold. Above 200K, the rate doubles and applies to the entire request, not just the overage. A 500K-token prompt to Gemini costs $2.00 because all 500K tokens get the $4/M rate, not because 200K is free and 300K is at $4/M. The same prompt to DeepSeek V4 Pro costs $0.87 — roughly 2.3× cheaper than Gemini and 2.9× cheaper than GPT-5.5.
At 1M, the spread widens. DeepSeek V4 Pro at $1.74 is a quarter of Opus 4.7 Adaptive's effective $6.75 and roughly a third of GPT-5.5 / Gemini's $4–$5. A daily workload of ten 1M-token prompts costs $17.40 on DeepSeek and $67.50 on Opus — that's a $50/day delta on a single integration.
Caching changes the picture entirely when it applies. Opus 4.7 Adaptive's prompt-cache hits drop input from $5/M to ~$0.50/M (90% off), turning a $6.75 effective 1M prompt into roughly $0.68 on cached reads. DeepSeek V4 Pro's cache hits land at $0.145/M — a $1.74 prompt becomes $0.145, about 92% off. GPT-5.5 has prompt caching available with vendor-published discounts, plus Batch API at ~50% off list. Gemini does not currently advertise a flat prompt-cache discount; its tier pricing is the lever instead.
Caching applies cleanly to chatbot system prompts, RAG pipelines with stable corpora, and code-review agents that re-read the same files across queries. It does not apply to one-shot deep analysis of a fresh document, ad-hoc research synthesis, or any workload where each request carries unique payloads. If 80% of your traffic re-reads the same context, Opus's caching turns Opus into the cheapest flagship in this comparison. If every request is fresh, DeepSeek V4 Pro is.
Output cost sits on top of all of this. The table above models input only. At list rates of $30/M (GPT-5.5), $25/M (Opus 4.7 Adaptive), $12 or $18/M (Gemini, depending on tier), and $3.48/M (DeepSeek V4 Pro), output can easily dominate total cost on generation-heavy workloads. Section 6 walks through where each profile actually wins.
Long context only matters relative to the workload. The four flagships diverge cleanly across four common patterns — document analysis, codebase review, long-form synthesis, and output-heavy generation. The right pick is rarely the same across all four.
Typical input range: 50K–300K tokens per document, single-pass read with a structured query (extract clauses, summarize findings, flag risks). Output is short — a few hundred to a few thousand tokens of structured response.
For inputs at or under 200K, Gemini 3.1 Pro is the cheapest closed flagship at $2/M with strong historical effective-context behavior. The math is straightforward: a 200K-token contract costs $0.40 to read on Gemini vs $1.00 on GPT-5.5 vs $1.35 on Opus (tokenizer-adjusted). Above 200K, Gemini's rate doubles to $4/M and DeepSeek V4 Pro at $1.74/M becomes the cost leader by a wide margin — a 300K-token filing runs $1.20 on Gemini (now in the higher tier) vs $0.52 on DeepSeek.
For document analysis under 200K tokens, use Gemini 3.1 Pro. Above 200K, DeepSeek V4 Pro takes over on cost.
Typical input range: 100K–500K for medium repos, 500K–1M for monorepos. The defining feature is repeated reads — the same code is queried many times across a session.
This is where Claude Opus 4.7 Adaptive wins on architecture, not on list price. Load the repo once and Opus's prompt cache makes every subsequent query land at ~$0.50/M effective input — a 500K-token monorepo costs $0.50 per query after the first load instead of $3.38 raw. Opus 4.7 Adaptive's strong agentic-coding profile (87.6 SWE-bench Verified, see Post 1) compounds the value. For one-shot ingestion or non-interactive batch workflows where caching can't apply, DeepSeek V4 Pro at $0.87 for 500K is the cost leader.
For interactive code review with stable context, Opus 4.7 Adaptive's caching is unbeatable. For batch ingestion, DeepSeek V4 Pro.
Typical input range: 200K–800K tokens of source material, output 30K–100K tokens of synthesized analysis. Both volumes matter — input drives cost, output drives whether the work is even possible in one call.
When output volume is the binding constraint (>64K), DeepSeek V4 Pro is the only sub-$30/M flagship with the ceiling to generate it. When output stays under 128K and the workload demands the deepest knowledge recall on dense academic material, GPT-5.5 is the call — its HLE 52.2 (the highest of the four, see Post 1) is the relevant proxy for "how much of the long tail does this model actually know."
For long-form synthesis, DeepSeek V4 Pro if output volume drives the choice; GPT-5.5 if reasoning depth on dense academic material does.
Typical range: 10K–50K input, 60K–384K output. The output ceiling is the binding constraint — input cost barely registers.
DeepSeek V4 Pro is the structural pick. 384K output, $3.48/M output rate. A 100K-token draft costs $0.348 to write — call that the floor. The same draft on GPT-5.5 fits at 128K output ceiling and costs $3.00. On Opus 4.7 Adaptive (effective $33.75/M with tokenizer inflation), call it ~$3.38 if the output fits, with stitching overhead if it doesn't. On Gemini 3.1 Pro at 64K, you stitch.
For output-heavy generation, DeepSeek V4 Pro is the structural pick — every other flagship caps before the work is done.
Most production work that thinks it needs 1M tokens does not. Real-world traffic at scale tends to fit in 200K — chat sessions, document Q&A, code reviews of files under ~50KLOC, RAG queries where retrieval has already narrowed the corpus. A rough heuristic: 80% of LLM workloads in production today fit comfortably below 200K input tokens. The 1M ceiling matters for the long tail, and pricing models punish you for using it.
The cost-per-useful-token argument is the one most teams underweight. A 1M-token prompt where 90% of the context is noise is a $1.74–$6.75 prompt that could have been a $0.17–$0.68 prompt with retrieval. The model still has to attend over the noise; the bill is paid in full. RAG was not killed by the 1M context era — it was made selectively useful. The calculus shifted from "RAG vs long context" to "RAG when corpus is dynamic or multi-tenant; long context when the session is one-shot deep analysis."
When RAG still beats 1M windows: dynamic corpora that update faster than you can re-prompt, multi-tenant isolation where one user's context cannot leak into another's, and any cost-bound workload at scale. When 1M beats RAG: single-session deep analysis where the value is cross-document reasoning, code-wide refactors that need to see every caller of a function, and any workflow where the relevance of any single chunk depends on the rest of the corpus. The honest answer is "use both" — retrieval to narrow, long context to reason — and pick a model whose cost profile fits the dominant pattern.
We're preparing an interactive context window visualizer for BenchLM — drop in a sample input, see how each flagship's effective-context curve treats it, and compare cost at length across all four models in a single view. Until it ships, the cost-at-length and effective-context tables in this post are the closest static substitute. For live pages today, use the long-context ranking, large context window shortlist, and LLM pricing dashboard.
All four current frontier flagships — Gemini 3.1 Pro, GPT-5.5, Claude Opus 4.7 Adaptive, and DeepSeek V4 Pro — advertise 1M-token input windows. The largest advertised input is effectively a tie at the 1M-class ceiling. The more useful question is largest output window, where DeepSeek V4 Pro's 384K output ceiling is 3× GPT-5.5's 128K and 6× both Gemini 3.1 Pro and Opus 4.7 Adaptive at 64K. For single-call long generation, DeepSeek V4 Pro is the only flagship in a different tier.
No. Bigger windows cost more per request and degrade in effectiveness past the model's reliable-recall depth. Most production workloads fit under 200K tokens — chat, document Q&A, RAG queries — and never exercise the 1M ceiling. Filling >80% of an advertised window without first running a smoke test on your data is how teams discover effective-context degradation in production. Use the smallest window that fits your workload at the cheapest per-token rate; reserve the 1M ceiling for one-shot deep analysis where cross-document reasoning is the value.
Input-only, no caching, single call: $1.74 on DeepSeek V4 Pro, $4.00 on Gemini 3.1 Pro (in the >200K tier), $5.00 on GPT-5.5, and ~$6.75 on Opus 4.7 Adaptive after tokenizer inflation. Output cost is on top — $3.48/M (DeepSeek), $12 or $18/M (Gemini, by tier), $30/M (GPT-5.5), and $25/M (Opus). Caching changes everything: Opus prompt-cache hits drop input to ~$0.50/M (90% off), DeepSeek to $0.145/M (~92% off). For workloads with stable system prompts or shared context, cache-aware pricing is the metric that matters, not list price.
It depends on input size and what you do with the document. Under 200K tokens, Gemini 3.1 Pro is the cheapest closed flagship with strong historical effective-context behavior. Above 200K, DeepSeek V4 Pro at $1.74/M flat is the cost leader by a wide margin. For interactive code review where the same document is re-read many times, Claude Opus 4.7 Adaptive with prompt caching turns repeat reads into ~$0.50/M operations. For dense academic material where deepest knowledge recall matters, GPT-5.5 leads on HLE.
Advertised context is the maximum input the model API will accept — a vendor specification. Effective context is how much of that input the model can actually reason over without losing precision, measured by long-context benchmarks like NIAH (single-fact retrieval), LongBench v2 (multi-document QA), MRCRv2 (multi-round coreference), and RULER (synthetic probes). The two numbers are not the same. A model can advertise 1M tokens and still degrade past 200K-300K. The "lost in the middle" effect, attention dilution at long sequence lengths, and positional encoding limits all contribute. Run a smoke test on your data before committing to the upper third of an advertised window.
Other comparisons and tools on benchlm.ai:
For a head-to-head leaderboard view across all four flagships, the BenchLM model explorer tracks every benchmark referenced in this post as new scores publish.
These rankings update with every new model. We send one email a week with what moved and why.
Share This Report
Copy the link, post it, or save a PDF version.
On this page
Which models moved up, what’s new, and what it costs. One email a week, 3-min read.
Free. One email per week.
Three frontier flagships launched in eight days. DeepSeek V4 Pro undercuts GPT-5.5 by ~9x on output price under MIT license. Here's how they compare on benchmarks, cost, and real use.
GPT-5.4 and Gemini 3.1 Pro are one point apart on BenchLM's leaderboard. We compare coding, math, reasoning, multimodal, agentic, pricing, and context window — with Gemini Deep Think as the reasoning wildcard.
Claude Opus 4.6 vs GPT-5.4 head-to-head: current benchmark scores, pricing, and where each model actually wins. GPT-5.4 now leads overall, while Claude stays extremely close and still has real workflow-specific advantages.