Learn how LLM API pricing works — from tokens and input/output costs to prompt caching, batch discounts, and model routing. Practical tips to cut your AI spend.
Share This Report
Copy the link, post it, or save a PDF version.
LLM APIs charge per token — typically $0.05 to $75 per million tokens depending on the model. A token is roughly 4 characters or 0.75 words. Here's exactly how pricing works, what drives the cost differences between models, and how to estimate and optimize your spend.
Want to check token counts right now? Try our free LLM token counter — paste any text and see counts across GPT-5, Claude, Gemini, and more.
A token is the basic unit of text that language models process. Rather than reading character-by-character or word-by-word, LLMs use tokenizers that split text into subword pieces.
Most modern LLMs use Byte-Pair Encoding (BPE), which learns common character sequences from training data. The result:
Rule of thumb: 1 token ≈ 4 characters ≈ 0.75 words in English.
Different models use different tokenizers, so the same text produces different token counts. The differences are usually within 5-10% for English text but can be larger for code, non-Latin scripts, or specialized terminology.
You can check exact counts with our LLM token counter, which uses real tokenizers for OpenAI models and calibrated estimates for others.
Every LLM API charges separately for input tokens (your prompt, system message, and any context) and output tokens (the model's response). Output tokens always cost more — typically 2-5x the input price.
Why? Input tokens are processed in a single forward pass through the model. Output tokens require autoregressive generation: the model must predict each token one at a time, running a full probability calculation across its vocabulary for every single output token.
Here's what this looks like across major models:
| Model | Input $/M | Output $/M | Output/Input Ratio |
|---|---|---|---|
| GPT-5 nano | $0.05 | $0.40 | 8x |
| DeepSeek V3 | $0.27 | $1.10 | 4.1x |
| Gemini 3.1 Pro | $1.25 | $5.00 | 4x |
| GPT-5.4 | $2.50 | $15.00 | 6x |
| Claude Sonnet 4.6 | $3.00 | $15.00 | 5x |
| Claude Opus 4.6 | $15.00 | $75.00 | 5x |
| GPT-5.4 Pro | $30.00 | $180.00 | 6x |
For the complete breakdown of every model, see our LLM pricing comparison table.
The practical implication: your prompt design matters more than model choice for cost control. Reducing output length (with max_tokens or explicit instructions) often saves more than switching models, because output tokens are the expensive part.
Reasoning models — o3, o4-mini, DeepSeek R1, and similar "chain-of-thought" models — introduce a cost multiplier that surprises many developers.
These models generate internal reasoning tokens (sometimes called "thinking tokens") that are used for intermediate reasoning steps. These tokens are billed as output tokens but may not appear in the visible response. The ratio of reasoning tokens to visible output can be extreme:
When to use reasoning models:
When to avoid them:
For a side-by-side comparison of reasoning vs. non-reasoning models, check our reasoning model rankings and non-reasoning rankings.
The basic formula:
Monthly cost = (avg input tokens × input price / 1M + avg output tokens × output price / 1M) × requests per month
Let's say you're building a chatbot that handles 1,000 conversations per day:
With GPT-5.4 ($2.50 input / $15.00 output per M tokens):
With DeepSeek V3 ($0.27 input / $1.10 output per M tokens):
That's a 12x cost difference for the same workload. Whether the quality difference justifies it depends on your use case.
For quick estimates without manual math, use our LLM cost calculator or AI cost calculator which handles blog posts, docs, and feature development estimates.
Both Anthropic and OpenAI offer automatic prompt caching that stores the computed state for repeated prompt prefixes. When the same prefix is reused (e.g., a long system prompt), cached input tokens cost ~10% of the normal price.
If your application uses a consistent system prompt or frequently includes the same context (like documentation or user history), caching can cut input costs by up to 90% on those cached portions.
OpenAI and Anthropic offer batch APIs with a 50% discount for non-real-time workloads. If your tasks don't need immediate responses — bulk classification, content generation, data extraction — batch processing halves your costs with no quality difference.
Not every request needs a frontier model. A model router sends simple tasks (classification, extraction, formatting) to cheap fast models (GPT-5 nano, Gemini 3.1 Flash-Lite) and complex tasks (reasoning, creative writing, code generation) to frontier models.
This approach can cut overall costs by 60-80% while maintaining quality where it matters. See our best budget LLMs guide for recommended models at each price tier.
Shorter prompts = fewer input tokens = lower cost. Common wins:
Set max_tokens to cap the model's response length. Without it, models may generate unnecessarily verbose responses. For structured output (JSON, classifications), explicit length constraints can reduce output tokens by 50-80%.
Models like Llama 4 Maverick, Qwen3.5 397B, and DeepSeek V3 are free to self-host. At high volumes (millions of requests/month), self-hosting can be dramatically cheaper than API pricing — though you need to factor in GPU infrastructure costs.
LLM pricing continues to drop rapidly. Input costs have fallen roughly 10x in the last 18 months, and the gap between frontier and budget models keeps narrowing. The best strategy today is to architect for flexibility — use model routing, monitor your actual token usage, and be ready to swap models as pricing changes.
To keep track of pricing shifts: check our live pricing table, count your tokens, or subscribe to our newsletter for updates when models launch or prices change.
Get weekly benchmark updates in your inbox.
Free. No spam. Unsubscribe anytime. We only store derived location metadata for consent routing.
Share This Report
Copy the link, post it, or save a PDF version.
On this page
New model releases, benchmark scores, and leaderboard changes. Every Friday.
Free. Your signup is stored with a derived country code for compliance routing.
Full LLM API pricing comparison for 2026 — input/output token costs for GPT-5, Claude, Gemini, DeepSeek, Grok, and more. Find the cheapest model for your use case.
Which budget LLM should you use in 2026? We rank GPT-5.4 mini, GPT-5.4 nano, MiniMax M2.7, Claude Haiku 4.5, Gemini Flash, DeepSeek, and more by benchmarks and price.
Which Chinese LLM is best in 2026? We rank Kimi K2.5, DeepSeek V3.2, Qwen3.5, GLM-5, MiMo, MiniMax M2.7, and more by benchmarks — coding, math, reasoning, and agentic tasks.