Current DeepSeek API pricing from the official docs: deepseek-chat and deepseek-reasoner, cache-hit vs cache-miss pricing, output pricing, and the current V3.2 endpoint mapping.
Share This Report
Copy the link, post it, or save a PDF version.
DeepSeek's pricing page is the simplest in the industry — two endpoints, one pricing table, three numbers. But those three numbers tell a story that changes how you should think about LLM cost optimization. At $0.028 per million input tokens on cache hits, DeepSeek makes input tokens essentially free. The real question becomes: what's the quality trade-off, and when does it matter?
This guide uses the current official DeepSeek pricing page, combined with benchmark data from BenchLM.ai and cross-provider pricing from sibling posts on Claude, OpenAI, and Gemini, to help you decide when DeepSeek's pricing makes it the right — and wrong — choice.
| Endpoint | Model Version | Context | Input Cache Hit $/M | Input Cache Miss $/M | Output $/M |
|---|---|---|---|---|---|
deepseek-chat |
DeepSeek-V3.2 | 128K | $0.028 | $0.28 | $0.42 |
deepseek-reasoner |
DeepSeek-V3.2 | 128K | $0.028 | $0.28 | $0.42 |
Two endpoints. Same underlying model. Same price. The real cost split in DeepSeek's current pricing is not chat versus reasoner — it is cache hit versus cache miss, a 10x difference on input tokens.
Compare this to the pricing complexity at other providers. OpenAI publishes separate rates for GPT-5.4, GPT-5.4 nano, GPT-5.4 mini, o3, and o4-mini — each with different input, output, and reasoning token prices. Anthropic has three Claude tiers with different ratios. Gemini has context-length-dependent pricing tiers. DeepSeek has one table with three numbers. That simplicity is worth appreciating, even if the model isn't competing at the frontier.
Output pricing is flat at $0.42 per million tokens regardless of caching or endpoint choice. There are no separate reasoning token charges, no context-length surcharges, no batch pricing tiers. What you see is what you pay.
This is the number that should reshape how you architect on DeepSeek. At $0.028 per million input tokens on a cache hit, a 2,000-token prompt costs $0.000056. That is $0.056 per thousand requests. Input becomes a rounding error.
To put that in perspective: sending a 2,000-token prompt on GPT-5.4 costs $0.005. On Claude Sonnet 4.6, $0.006. On DeepSeek with a cache hit, $0.000056. DeepSeek's cached input is roughly 90x cheaper than frontier model input.
The standard advice for expensive models is to minimize input tokens. Shorter system prompts, fewer few-shot examples, compressed context. Every token you add to a GPT-5.4 or Claude Opus request costs real money at scale.
On DeepSeek with caching, that logic inverts. Input tokens are so cheap that you should optimize for more context, not less. Longer system prompts with detailed instructions. More few-shot examples to demonstrate the exact output format you want. Richer context from your retrieval pipeline. The marginal cost of an extra 1,000 input tokens on a cache hit is $0.000028 — effectively zero. If adding those tokens improves output quality by even 1%, it's the best ROI in your entire stack.
DeepSeek's caching works on shared prefixes. The key design pattern: structure your prompts so the prefix — system prompt, few-shot examples, stable context — is as large and consistent as possible. The variable part — the user's actual query — goes at the end.
This means:
The difference between a workload with 0% cache hits and 90% cache hits is enormous:
| Cache Hit Rate | Effective Input Cost per M Tokens |
|---|---|
| 0% (all misses) | $0.280 |
| 25% | $0.217 |
| 50% | $0.154 |
| 75% | $0.091 |
| 90% | $0.053 |
| 100% (all hits) | $0.028 |
A well-designed application with consistent system prompts should achieve 75-90% cache hit rates. At 90%, your effective input cost is $0.053 per million tokens — less than a fifth of the already-cheap cache miss rate.
Assume 10,000 requests per day, each with 2,000 input tokens and 300 output tokens.
Scenario 1 — All cache misses on DeepSeek:
Scenario 2 — 90% cache hits on DeepSeek:
Scenario 3 — The same workload on GPT-5.4 (no caching):
Same request volume. DeepSeek with caching costs $69.60/month. GPT-5.4 costs $2,850/month. That's a 41x cost difference. Even DeepSeek without caching ($205.80) is nearly 14x cheaper than GPT-5.4.
Use the cost calculator to model your own workload, or the token counter to estimate token counts from your actual prompts.
Both endpoints currently map to DeepSeek-V3.2 and cost exactly the same per token. The choice between them is about capability and behavior, not price.
deepseek-chatdeepseek-reasonerUse deepseek-chat for the majority of workloads: general Q&A, content generation, code completion, classification, extraction, and any task where a direct answer is sufficient. It's faster because it doesn't generate thinking tokens, and the 4-8K output cap is enough for most use cases.
Use deepseek-reasoner when the task benefits from explicit chain-of-thought: multi-step math, logic puzzles, complex analysis, and problems where showing the work improves accuracy. The 32-64K output cap also matters — if your task requires long-form generation beyond 8K tokens, reasoner is your only option.
One detail to watch: reasoner generates thinking tokens that count toward output cost. A reasoning request might produce 5,000 thinking tokens plus 500 visible output tokens — that's 5,500 output tokens billed at $0.42/M, costing $0.0023 per request.
At DeepSeek's prices, this is still dirt cheap. The same kind of reasoning on o3 or Claude Opus would cost 50-100x more. But if you're running reasoner on millions of requests, the thinking token multiplier adds up. Monitor your actual output token counts, not just the visible response length.
Here's where the pricing story gets complicated. DeepSeek is extraordinarily cheap — but cheap tokens that produce wrong answers aren't saving you money. They're costing you rework.
| Model | BenchLM Score | Input $/M (cache miss) | Output $/M | Score per dollar (output) |
|---|---|---|---|---|
| DeepSeek V3.2 (chat) | 62 | $0.28 | $0.42 | 148 |
| GPT-5.4 nano | 49 | $0.20 | $1.25 | 39.2 |
| Gemini 3.1 Flash-Lite | 54 | $0.25 | $1.50 | 36.0 |
| GPT-5.4 | 84 | $2.50 | $15.00 | 5.6 |
| Claude Opus 4.6 | 80 | $5.00 | $25.00 | 3.2 |
BenchLM overall scores from BenchLM.ai. Prices per million tokens.
On raw benchmark-points-per-dollar, DeepSeek wins by an absurd margin. At roughly 148 points per output dollar, it delivers vastly more benchmark score per dollar than frontier-priced models.
But a BenchLM score of 62 versus 84 isn't a minor gap — it's still a fundamentally different quality tier. Here's what that gap means in practice:
The honest assessment: DeepSeek is excellent for tasks where "good enough" is good enough. It is not the right choice when errors are expensive. The 41x cost savings only matter if the output is actually usable — test on your specific task before committing.
DeepSeek has experienced outages during high-demand periods. If you're building production systems on DeepSeek, you need a fallback architecture — not because DeepSeek is unreliable by default, but because any cost-optimized system should handle provider downtime gracefully.
Primary: DeepSeek. Secondary: a cheap model from a provider with high uptime guarantees. The two natural choices:
Assume 10% of your requests hit the fallback model due to DeepSeek outages or rate limits:
That blended output rate of $0.503/M is still 30x cheaper than GPT-5.4's $15/M and 50x cheaper than Claude Opus's $25/M. The cost of reliability insurance is negligible compared to your savings.
Libraries like LiteLLM make automatic failover straightforward. Define DeepSeek as your primary model, set a timeout threshold, and configure one or two fallback models. The routing logic adds minimal latency on the happy path and saves you from building manual retry logic.
The key design principle: your fallback model should be cheap enough that you never hesitate to use it, and reliable enough that it doesn't need its own fallback. GPT-5.4 nano and Gemini Flash-Lite both fit this profile.
DeepSeek models are open-weight — you can download and run them on your own infrastructure for free. This is a genuine advantage that no closed-source provider offers. But at DeepSeek's current API prices, the economics of self-hosting are unusual.
A single A100 GPU rents for roughly $1-3/hour from cloud providers. At $2/hour, that's $1,440/month in GPU costs before any engineering time, networking, or storage.
To break even against the API at cache-miss rates, you'd need to process enough tokens to accumulate $1,440 in API fees. At $0.28/M input tokens and 2,000 tokens per request, that's roughly 2.5 million requests per month — about 83,000 per day.
At cache-hit rates ($0.028/M), the break-even is even higher: you'd need 25 million requests per month just on input costs to justify the GPU rental.
For most teams, the API is cheaper than self-hosting unless you're operating at genuine scale — tens of thousands of requests per day sustained.
The decision to self-host isn't always about cost:
If you need DeepSeek in a specific geographic region but don't want to manage GPU infrastructure, third-party inference providers like Together AI and Fireworks host DeepSeek models in US and EU data centers. Their prices are higher than DeepSeek's own API — typically $0.50-1.50/M for input — but still far cheaper than frontier models, and you get the data residency and reliability of established cloud providers.
An honest pricing guide should tell you when to spend more. DeepSeek's cost advantage is real, but there are workloads where choosing the cheapest model is a false economy.
If wrong answers cost more than the savings — legal analysis, medical triage, financial compliance, safety-critical systems — the gap between a BenchLM score of 62 and 84 translates directly to error rates you can't afford. Spend the extra money on GPT-5.4 or Claude Opus 4.6 for these tasks.
Claude Opus 4.6 leads the industry on instruction following with an Arena IF score of 1500. If your workflow depends on the model respecting complex formatting constraints, multi-step instructions, or brand voice guidelines, DeepSeek will require more prompt iteration and produce more non-compliant outputs. The debugging time can exceed the cost savings.
DeepSeek's 128K context window is generous, but Gemini 3.1 Pro offers 1M tokens — nearly 8x more. For workloads involving full codebases, long legal documents, or book-length analysis in a single pass, Gemini's context advantage is worth the higher price.
If your application has strict uptime SLAs that preclude any provider downtime, you need a provider with formal SLA guarantees. DeepSeek's API has historically experienced periods of degraded performance during high demand. Building a fallback architecture mitigates this, but if even brief outages are unacceptable, a primary deployment on OpenAI or Google's infrastructure is safer.
DeepSeek's API routes through infrastructure subject to Chinese data handling regulations. For enterprises with strict data sovereignty requirements — EU GDPR, US government, healthcare — this may be a non-starter unless you self-host or use a third-party inference provider in your required jurisdiction.
DeepSeek's pricing makes it the default choice for cost-sensitive workloads where quality is "good enough" — and that covers a surprisingly large share of real-world LLM use cases.
Four rules for getting the most out of DeepSeek:
Design for cache hits. Long, stable system prompts at the beginning. Variable user queries at the end. Aim for 75%+ cache hit rates. At 90% cache hits, your effective input cost drops to $0.053/M — essentially free.
Build with fallbacks. DeepSeek primary, GPT-5.4 nano or Gemini Flash-Lite secondary. Your blended cost stays far below frontier pricing, and you gain reliability insurance that costs almost nothing.
Choose chat vs reasoner by behavior, not price. They cost the same. Use chat for speed and simplicity, reasoner for tasks that benefit from chain-of-thought. Watch thinking token costs on reasoner if you're running high volume.
Test quality on YOUR task before committing. A BenchLM score of 62 still means some workloads will show a meaningful gap versus frontier models. Run 50 representative prompts through both DeepSeek and a stronger alternative. If the outputs are equivalent for your use case, the savings are real. If they're not, no amount of caching makes up for wrong answers.
DeepSeek is not trying to be the best model. It's trying to be the model where the price-to-quality ratio is so extreme that you can't justify using anything else for the bottom 60% of your workload. On that metric, nothing else comes close.
For a broader vendor comparison, see the LLM pricing overview. For provider-specific deep dives: Claude pricing, OpenAI pricing, Gemini pricing. Use the cost calculator to model your workload, the token counter to estimate token volumes from your prompts, or read how token pricing works for a primer on LLM cost mechanics.
Pricing from DeepSeek's official pricing page. Benchmark scores from BenchLM.ai. Arena scores from arena.ai. Current as of April 2026.
Model pricing changes frequently. We send one email a week with what moved and why.
Share This Report
Copy the link, post it, or save a PDF version.
On this page
Which models moved up, what’s new, and what it costs. One email a week, 3-min read.
Free. One email per week.
Current Anthropic Claude API pricing from official model pages and the Claude Opus 4.7 launch announcement, including prompt caching, batch discounts, and current long-context notes.
Current Gemini API pricing from Google's official docs: 3.1 Pro Preview, 3.1 Flash-Lite Preview, 3 Flash Preview, 2.5 Flash, 2.5 Pro, plus Batch and Flex pricing.
Current OpenAI API pricing from official docs: GPT-5.4, GPT-5.2, GPT-5.1, cached input rates, Batch API discounts, and the pricing details that actually matter.