Skip to main content
pricingdeepseekapicostguidebudget

DeepSeek API Pricing: deepseek-chat vs deepseek-reasoner (April 2026)

Current DeepSeek API pricing from the official docs: deepseek-chat and deepseek-reasoner, cache-hit vs cache-miss pricing, output pricing, and the current V3.2 endpoint mapping.

Glevd·Published April 13, 2026·12 min read

Share This Report

Copy the link, post it, or save a PDF version.

Share on XShare on LinkedIn

DeepSeek's pricing page is the simplest in the industry — two endpoints, one pricing table, three numbers. But those three numbers tell a story that changes how you should think about LLM cost optimization. At $0.028 per million input tokens on cache hits, DeepSeek makes input tokens essentially free. The real question becomes: what's the quality trade-off, and when does it matter?

This guide uses the current official DeepSeek pricing page, combined with benchmark data from BenchLM.ai and cross-provider pricing from sibling posts on Claude, OpenAI, and Gemini, to help you decide when DeepSeek's pricing makes it the right — and wrong — choice.

DeepSeek pricing — the simplest table in the industry

Endpoint Model Version Context Input Cache Hit $/M Input Cache Miss $/M Output $/M
deepseek-chat DeepSeek-V3.2 128K $0.028 $0.28 $0.42
deepseek-reasoner DeepSeek-V3.2 128K $0.028 $0.28 $0.42

Two endpoints. Same underlying model. Same price. The real cost split in DeepSeek's current pricing is not chat versus reasoner — it is cache hit versus cache miss, a 10x difference on input tokens.

Compare this to the pricing complexity at other providers. OpenAI publishes separate rates for GPT-5.4, GPT-5.4 nano, GPT-5.4 mini, o3, and o4-mini — each with different input, output, and reasoning token prices. Anthropic has three Claude tiers with different ratios. Gemini has context-length-dependent pricing tiers. DeepSeek has one table with three numbers. That simplicity is worth appreciating, even if the model isn't competing at the frontier.

Output pricing is flat at $0.42 per million tokens regardless of caching or endpoint choice. There are no separate reasoning token charges, no context-length surcharges, no batch pricing tiers. What you see is what you pay.

The $0.028 cache hit — why this number changes everything

This is the number that should reshape how you architect on DeepSeek. At $0.028 per million input tokens on a cache hit, a 2,000-token prompt costs $0.000056. That is $0.056 per thousand requests. Input becomes a rounding error.

To put that in perspective: sending a 2,000-token prompt on GPT-5.4 costs $0.005. On Claude Sonnet 4.6, $0.006. On DeepSeek with a cache hit, $0.000056. DeepSeek's cached input is roughly 90x cheaper than frontier model input.

This changes your prompt engineering strategy

The standard advice for expensive models is to minimize input tokens. Shorter system prompts, fewer few-shot examples, compressed context. Every token you add to a GPT-5.4 or Claude Opus request costs real money at scale.

On DeepSeek with caching, that logic inverts. Input tokens are so cheap that you should optimize for more context, not less. Longer system prompts with detailed instructions. More few-shot examples to demonstrate the exact output format you want. Richer context from your retrieval pipeline. The marginal cost of an extra 1,000 input tokens on a cache hit is $0.000028 — effectively zero. If adding those tokens improves output quality by even 1%, it's the best ROI in your entire stack.

Design your prompts for cache hits

DeepSeek's caching works on shared prefixes. The key design pattern: structure your prompts so the prefix — system prompt, few-shot examples, stable context — is as large and consistent as possible. The variable part — the user's actual query — goes at the end.

This means:

  • Your system prompt should be detailed and static across requests
  • Few-shot examples belong before the user query, not after
  • Any shared document context or retrieved knowledge should be placed in a stable position in the prompt
  • Only the final user message should vary between requests

Cache hit rate drives your real cost

The difference between a workload with 0% cache hits and 90% cache hits is enormous:

Cache Hit Rate Effective Input Cost per M Tokens
0% (all misses) $0.280
25% $0.217
50% $0.154
75% $0.091
90% $0.053
100% (all hits) $0.028

A well-designed application with consistent system prompts should achieve 75-90% cache hit rates. At 90%, your effective input cost is $0.053 per million tokens — less than a fifth of the already-cheap cache miss rate.

Real savings: the same workload, three scenarios

Assume 10,000 requests per day, each with 2,000 input tokens and 300 output tokens.

Scenario 1 — All cache misses on DeepSeek:

  • Daily input: 10,000 x 2,000 x $0.28 / 1M = $5.60
  • Daily output: 10,000 x 300 x $0.42 / 1M = $1.26
  • Monthly total: about $205.80

Scenario 2 — 90% cache hits on DeepSeek:

  • Daily input: (1,000 x 2,000 x $0.28 / 1M) + (9,000 x 2,000 x $0.028 / 1M) = $0.56 + $0.504 = $1.06
  • Daily output: 10,000 x 300 x $0.42 / 1M = $1.26
  • Monthly total: about $69.60

Scenario 3 — The same workload on GPT-5.4 (no caching):

  • Daily input: 10,000 x 2,000 x $2.50 / 1M = $50.00
  • Daily output: 10,000 x 300 x $15.00 / 1M = $45.00
  • Monthly total: about $2,850.00

Same request volume. DeepSeek with caching costs $69.60/month. GPT-5.4 costs $2,850/month. That's a 41x cost difference. Even DeepSeek without caching ($205.80) is nearly 14x cheaper than GPT-5.4.

Use the cost calculator to model your own workload, or the token counter to estimate token counts from your actual prompts.

Chat vs Reasoner — same price, different behavior

Both endpoints currently map to DeepSeek-V3.2 and cost exactly the same per token. The choice between them is about capability and behavior, not price.

deepseek-chat

  • Non-thinking mode
  • Default max output: 4K / Maximum output: 8K
  • Supports JSON output, tool calls, chat prefix completion (beta)
  • Supports FIM completion (beta) — the only endpoint with fill-in-the-middle

deepseek-reasoner

  • Thinking mode — generates chain-of-thought before the final answer
  • Default max output: 32K / Maximum output: 64K
  • Supports JSON output, tool calls, chat prefix completion (beta)
  • Does not support FIM completion

When to use which

Use deepseek-chat for the majority of workloads: general Q&A, content generation, code completion, classification, extraction, and any task where a direct answer is sufficient. It's faster because it doesn't generate thinking tokens, and the 4-8K output cap is enough for most use cases.

Use deepseek-reasoner when the task benefits from explicit chain-of-thought: multi-step math, logic puzzles, complex analysis, and problems where showing the work improves accuracy. The 32-64K output cap also matters — if your task requires long-form generation beyond 8K tokens, reasoner is your only option.

The hidden cost of thinking tokens

One detail to watch: reasoner generates thinking tokens that count toward output cost. A reasoning request might produce 5,000 thinking tokens plus 500 visible output tokens — that's 5,500 output tokens billed at $0.42/M, costing $0.0023 per request.

At DeepSeek's prices, this is still dirt cheap. The same kind of reasoning on o3 or Claude Opus would cost 50-100x more. But if you're running reasoner on millions of requests, the thinking token multiplier adds up. Monitor your actual output token counts, not just the visible response length.

Benchmark-adjusted value — the quality trade-off nobody should ignore

Here's where the pricing story gets complicated. DeepSeek is extraordinarily cheap — but cheap tokens that produce wrong answers aren't saving you money. They're costing you rework.

Model BenchLM Score Input $/M (cache miss) Output $/M Score per dollar (output)
DeepSeek V3.2 (chat) 62 $0.28 $0.42 148
GPT-5.4 nano 49 $0.20 $1.25 39.2
Gemini 3.1 Flash-Lite 54 $0.25 $1.50 36.0
GPT-5.4 84 $2.50 $15.00 5.6
Claude Opus 4.6 80 $5.00 $25.00 3.2

BenchLM overall scores from BenchLM.ai. Prices per million tokens.

On raw benchmark-points-per-dollar, DeepSeek wins by an absurd margin. At roughly 148 points per output dollar, it delivers vastly more benchmark score per dollar than frontier-priced models.

But a BenchLM score of 62 versus 84 isn't a minor gap — it's still a fundamentally different quality tier. Here's what that gap means in practice:

Where DeepSeek is good enough

  • General Q&A and chatbots — conversational tasks where approximate answers are acceptable and a human can spot-check
  • Content drafts for internal use — summaries, notes, brainstorming, first drafts that will be edited
  • Code completion and simple generation — boilerplate, repetitive patterns, straightforward implementations
  • Classification and extraction — tagging, routing, pulling structured data from unstructured text
  • High-volume preprocessing — any pipeline step where you process thousands of items and the occasional error is tolerable

Where the quality gap bites

  • Hard reasoning tasks — DeepSeek misses problems that frontier models solve correctly. If your task involves multi-step logical inference or mathematical reasoning, the error rate is measurably higher.
  • Complex instruction following — frontier models like Claude Opus (Arena IF: 1500) and GPT-5.4 (Arena IF: 1470) are significantly more reliable at following detailed, multi-constraint instructions. DeepSeek is more likely to ignore constraints or produce partially compliant output.
  • Agentic workflows — long tool-use chains where each step depends on the previous one. Errors compound, and a model that's 90% accurate per step becomes 35% accurate over 10 steps. Frontier models' higher per-step accuracy matters exponentially in these chains.
  • Safety-critical output — legal analysis, medical information, compliance documentation, anything where a wrong answer has real consequences.

The honest assessment: DeepSeek is excellent for tasks where "good enough" is good enough. It is not the right choice when errors are expensive. The 41x cost savings only matter if the output is actually usable — test on your specific task before committing.

Building reliable systems on DeepSeek

DeepSeek has experienced outages during high-demand periods. If you're building production systems on DeepSeek, you need a fallback architecture — not because DeepSeek is unreliable by default, but because any cost-optimized system should handle provider downtime gracefully.

The fallback pattern

Primary: DeepSeek. Secondary: a cheap model from a provider with high uptime guarantees. The two natural choices:

  • GPT-5.4 nano at $0.20/$1.25 — OpenAI's infrastructure reliability, reasonable quality at low cost
  • Gemini 3.1 Flash-Lite at $0.25/$1.50 — Google's infrastructure, competitive pricing

Blended cost with fallback

Assume 10% of your requests hit the fallback model due to DeepSeek outages or rate limits:

  • 90% on DeepSeek (cache miss): $0.28 input / $0.42 output
  • 10% on GPT-5.4 nano: $0.20 input / $1.25 output
  • Blended rate: $0.272 input / $0.503 output

That blended output rate of $0.503/M is still 30x cheaper than GPT-5.4's $15/M and 50x cheaper than Claude Opus's $25/M. The cost of reliability insurance is negligible compared to your savings.

Implementation

Libraries like LiteLLM make automatic failover straightforward. Define DeepSeek as your primary model, set a timeout threshold, and configure one or two fallback models. The routing logic adds minimal latency on the happy path and saves you from building manual retry logic.

The key design principle: your fallback model should be cheap enough that you never hesitate to use it, and reliable enough that it doesn't need its own fallback. GPT-5.4 nano and Gemini Flash-Lite both fit this profile.

The self-hosting question

DeepSeek models are open-weight — you can download and run them on your own infrastructure for free. This is a genuine advantage that no closed-source provider offers. But at DeepSeek's current API prices, the economics of self-hosting are unusual.

The break-even math

A single A100 GPU rents for roughly $1-3/hour from cloud providers. At $2/hour, that's $1,440/month in GPU costs before any engineering time, networking, or storage.

To break even against the API at cache-miss rates, you'd need to process enough tokens to accumulate $1,440 in API fees. At $0.28/M input tokens and 2,000 tokens per request, that's roughly 2.5 million requests per month — about 83,000 per day.

At cache-hit rates ($0.028/M), the break-even is even higher: you'd need 25 million requests per month just on input costs to justify the GPU rental.

For most teams, the API is cheaper than self-hosting unless you're operating at genuine scale — tens of thousands of requests per day sustained.

When to self-host anyway

The decision to self-host isn't always about cost:

  • Data residency requirements — if your data cannot leave your infrastructure for regulatory or compliance reasons, self-hosting is the only option
  • Latency guarantees — self-hosted models eliminate network round-trips and API queue times, giving you predictable low-latency inference
  • Customization — fine-tuning, custom tokenizers, or model modifications require running your own infrastructure
  • No rate limits — the API has throughput limits during peak demand; self-hosted inference scales with your hardware

The middle ground

If you need DeepSeek in a specific geographic region but don't want to manage GPU infrastructure, third-party inference providers like Together AI and Fireworks host DeepSeek models in US and EU data centers. Their prices are higher than DeepSeek's own API — typically $0.50-1.50/M for input — but still far cheaper than frontier models, and you get the data residency and reliability of established cloud providers.

When DeepSeek isn't the right choice

An honest pricing guide should tell you when to spend more. DeepSeek's cost advantage is real, but there are workloads where choosing the cheapest model is a false economy.

When quality gaps have consequences

If wrong answers cost more than the savings — legal analysis, medical triage, financial compliance, safety-critical systems — the gap between a BenchLM score of 62 and 84 translates directly to error rates you can't afford. Spend the extra money on GPT-5.4 or Claude Opus 4.6 for these tasks.

When instruction following precision matters

Claude Opus 4.6 leads the industry on instruction following with an Arena IF score of 1500. If your workflow depends on the model respecting complex formatting constraints, multi-step instructions, or brand voice guidelines, DeepSeek will require more prompt iteration and produce more non-compliant outputs. The debugging time can exceed the cost savings.

When you need the largest context window

DeepSeek's 128K context window is generous, but Gemini 3.1 Pro offers 1M tokens — nearly 8x more. For workloads involving full codebases, long legal documents, or book-length analysis in a single pass, Gemini's context advantage is worth the higher price.

When reliability is non-negotiable

If your application has strict uptime SLAs that preclude any provider downtime, you need a provider with formal SLA guarantees. DeepSeek's API has historically experienced periods of degraded performance during high demand. Building a fallback architecture mitigates this, but if even brief outages are unacceptable, a primary deployment on OpenAI or Google's infrastructure is safer.

When data residency matters and you can't self-host

DeepSeek's API routes through infrastructure subject to Chinese data handling regulations. For enterprises with strict data sovereignty requirements — EU GDPR, US government, healthcare — this may be a non-starter unless you self-host or use a third-party inference provider in your required jurisdiction.

The practical takeaway

DeepSeek's pricing makes it the default choice for cost-sensitive workloads where quality is "good enough" — and that covers a surprisingly large share of real-world LLM use cases.

Four rules for getting the most out of DeepSeek:

  1. Design for cache hits. Long, stable system prompts at the beginning. Variable user queries at the end. Aim for 75%+ cache hit rates. At 90% cache hits, your effective input cost drops to $0.053/M — essentially free.

  2. Build with fallbacks. DeepSeek primary, GPT-5.4 nano or Gemini Flash-Lite secondary. Your blended cost stays far below frontier pricing, and you gain reliability insurance that costs almost nothing.

  3. Choose chat vs reasoner by behavior, not price. They cost the same. Use chat for speed and simplicity, reasoner for tasks that benefit from chain-of-thought. Watch thinking token costs on reasoner if you're running high volume.

  4. Test quality on YOUR task before committing. A BenchLM score of 62 still means some workloads will show a meaningful gap versus frontier models. Run 50 representative prompts through both DeepSeek and a stronger alternative. If the outputs are equivalent for your use case, the savings are real. If they're not, no amount of caching makes up for wrong answers.

DeepSeek is not trying to be the best model. It's trying to be the model where the price-to-quality ratio is so extreme that you can't justify using anything else for the bottom 60% of your workload. On that metric, nothing else comes close.

For a broader vendor comparison, see the LLM pricing overview. For provider-specific deep dives: Claude pricing, OpenAI pricing, Gemini pricing. Use the cost calculator to model your workload, the token counter to estimate token volumes from your prompts, or read how token pricing works for a primer on LLM cost mechanics.

Pricing from DeepSeek's official pricing page. Benchmark scores from BenchLM.ai. Arena scores from arena.ai. Current as of April 2026.

Model pricing changes frequently. We send one email a week with what moved and why.