Learn how LLM API pricing works — from tokens, input/output costs, and reasoning tokens to vision, embedding, and fine-tuning pricing. Includes real cost examples, free tiers, and 6 strategies to cut your AI spend.
Share This Report
Copy the link, post it, or save a PDF version.
LLM APIs charge per token — typically $0.05 to $75 per million tokens depending on the model. A token is roughly 4 characters or 0.75 words. Here's exactly how pricing works, what drives the cost differences between models, and how to estimate and optimize your spend.
Want to check token counts right now? Try our free LLM token counter — paste any text and see counts across GPT-5, Claude, Gemini, and more.
A token is the basic unit of text that language models process. Rather than reading character-by-character or word-by-word, LLMs use tokenizers that split text into subword pieces.
Most modern LLMs use Byte-Pair Encoding (BPE), which learns common character sequences from training data. The result:
Rule of thumb: 1 token ≈ 4 characters ≈ 0.75 words in English.
Different models use different tokenizers, so the same text produces different token counts. The differences are usually within 5-10% for English text but can be larger for code, non-Latin scripts, or specialized terminology.
You can check exact counts with our LLM token counter, which uses real tokenizers for OpenAI models and calibrated estimates for others.
Every LLM API charges separately for input tokens (your prompt, system message, and any context) and output tokens (the model's response). Output tokens always cost more — typically 2-5x the input price.
Why? Input tokens are processed in a single forward pass through the model. Output tokens require autoregressive generation: the model must predict each token one at a time, running a full probability calculation across its vocabulary for every single output token.
Here's what this looks like across major models:
| Model | Input $/M | Output $/M | Output/Input Ratio | Context Window |
|---|---|---|---|---|
| GPT-5 nano | $0.05 | $0.40 | 8x | 128K |
| DeepSeek V3 | $0.27 | $1.10 | 4.1x | 128K |
| Gemini 3.1 Flash | $0.15 | $0.60 | 4x | 1M |
| Gemini 3.1 Pro | $1.25 | $5.00 | 4x | 2M |
| GPT-5.4 | $2.50 | $15.00 | 6x | 256K |
| Claude Sonnet 4.6 | $3.00 | $15.00 | 5x | 200K |
| Claude Opus 4.6 | $15.00 | $75.00 | 5x | 200K |
| GPT-5.4 Pro | $30.00 | $180.00 | 6x | 256K |
For the complete breakdown of every model, see our LLM pricing comparison table.
Key takeaway: Context window size matters for cost. Models with larger windows (Gemini's 1-2M tokens) let you send more context per request, but more context means more input tokens billed. Your prompt design matters more than model choice for cost control — reducing output length (with max_tokens or explicit instructions) often saves more than switching models, because output tokens are the expensive part.
Reasoning models — o3, o4-mini, DeepSeek R1, and similar "chain-of-thought" models — introduce a cost multiplier that surprises many developers.
These models generate internal reasoning tokens (sometimes called "thinking tokens") that are used for intermediate reasoning steps. These tokens are billed as output tokens but may not appear in the visible response. The ratio of reasoning tokens to visible output can be extreme:
When to use reasoning models:
When to avoid them:
For a side-by-side comparison of reasoning vs. non-reasoning models, check our reasoning model rankings and non-reasoning rankings.
Token pricing for text generation is just one piece of the cost puzzle. If you're building production AI applications, you'll likely encounter these adjacent costs too.
Multimodal models can process images as input. How they charge varies by provider:
Bottom line: If vision is a core part of your pipeline, Claude and Gemini are significantly cheaper for image processing. For occasional image analysis, the cost difference is negligible.
Embeddings power search, RAG pipelines, and similarity matching. They're dramatically cheaper than text generation — typically $0.02-$0.20 per million tokens:
| Model | $/M tokens |
|---|---|
| OpenAI text-embedding-3-small | $0.02 |
| Voyage AI voyage-4-lite | $0.02 |
| Voyage AI voyage-4 | $0.06 |
| OpenAI text-embedding-3-large | $0.13 |
| Google Gemini Embedding 001 | $0.15 |
Embedding costs are usually a rounding error compared to generation costs. For a RAG application doing 10,000 queries/day with 500 tokens per query, embedding costs would be ~$0.30/month with text-embedding-3-small. The generation step is where the real cost lives.
Fine-tuning lets you customize a model on your own data. It has two cost components:
Fine-tuning makes economic sense when it lets you use a smaller, cheaper model to achieve the quality of a larger one — or when you need specialized behavior that no amount of prompting can achieve. For most use cases, prompt engineering and few-shot examples are more cost-effective.
The basic formula:
Monthly cost = (avg input tokens × input price / 1M + avg output tokens × output price / 1M) × requests per month
Let's say you're building a chatbot that handles 1,000 conversations per day:
With GPT-5.4 ($2.50 input / $15.00 output per M tokens):
With DeepSeek V3 ($0.27 input / $1.10 output per M tokens):
That's a 12x cost difference for the same workload. Whether the quality difference justifies it depends on your use case.
A retrieval-augmented generation app that answers questions over internal docs, handling 5,000 queries per day:
With Claude Sonnet 4.6 ($3.00 input / $15.00 output per M tokens) + text-embedding-3-small ($0.02/M):
With prompt caching on the 500-token system prompt (90% savings on cached portion):
The real savings here come from optimizing your retrieval — sending 1,000 tokens of context instead of 2,000 cuts the input cost nearly in half.
Generating 500 product descriptions per day, each ~200 words output:
With GPT-5.4 batch ($1.25 input / $7.50 output per M tokens):
With GPT-5 nano batch ($0.025 input / $0.20 output per M tokens):
For structured, repetitive content generation, budget models with batch APIs can be almost free.
For quick estimates without manual math, use our cost calculator.
Both Anthropic and OpenAI offer automatic prompt caching that stores the computed state for repeated prompt prefixes. When the same prefix is reused (e.g., a long system prompt), cached input tokens cost ~10% of the normal price.
If your application uses a consistent system prompt or frequently includes the same context (like documentation or user history), caching can cut input costs by up to 90% on those cached portions.
OpenAI and Anthropic offer batch APIs with a 50% discount for non-real-time workloads. If your tasks don't need immediate responses — bulk classification, content generation, data extraction — batch processing halves your costs with no quality difference.
Not every request needs a frontier model. A model router sends simple tasks (classification, extraction, formatting) to cheap fast models (GPT-5 nano, Gemini 3.1 Flash-Lite) and complex tasks (reasoning, creative writing, code generation) to frontier models.
This approach can cut overall costs by 60-80% while maintaining quality where it matters. See our best budget LLMs guide for recommended models at each price tier.
Shorter prompts = fewer input tokens = lower cost. Common wins:
Set max_tokens to cap the model's response length. Without it, models may generate unnecessarily verbose responses. For structured output (JSON, classifications), explicit length constraints can reduce output tokens by 50-80%.
Models like Llama 4 Maverick, Qwen3.5 397B, and DeepSeek V3 are free to self-host. At high volumes (millions of requests/month), self-hosting can be dramatically cheaper than API pricing — though you need to factor in GPU infrastructure costs ($1-$3/hr per GPU). The break-even point is typically around 50,000+ daily requests for smaller models.
You don't need to spend money to start experimenting. Most providers offer free access:
| Provider | Free offering | Best for |
|---|---|---|
| Google AI Studio | Free tier with rate limits (15 RPM for Flash, 2 RPM for Pro) | Prototyping with Gemini models |
| DeepSeek | 5M free tokens, no credit card required | Budget-conscious production use |
| OpenAI | $5 credit for new accounts (expires in 3 months) | Testing GPT-5 models |
| Anthropic | $5 credit for new accounts | Testing Claude models |
| Mistral | Free tier via La Plateforme | Experimenting with open-weight alternatives |
| Voyage AI | 200M free embedding tokens | Building RAG pipelines |
Google AI Studio is the most generous for ongoing free usage. DeepSeek is the best option if you need low-cost production access — even after free credits expire, V3 at $0.27/$1.10 per M tokens is among the cheapest APIs available.
A model might be cheap per token but throttled so heavily that it can't serve your traffic — effectively making it more expensive because you need a fallback.
Rate limits vary dramatically by provider and spend tier:
Practical implications:
Factor rate limits into your model selection alongside per-token pricing. The cheapest model on paper may not be viable if you can't get enough throughput at your spend level.
LLM pricing continues to drop rapidly. Input costs have fallen roughly 10x in the last 18 months, and the gap between frontier and budget models keeps narrowing. Track these shifts on our pricing trends page to see historical price curves across providers.
The best strategy today is to architect for flexibility — use model routing, monitor your actual token usage, and be ready to swap models as pricing changes.
To keep track of pricing shifts: check our live pricing table, count your tokens, estimate your costs, or subscribe to our newsletter for updates when models launch or prices change.
Model pricing changes frequently. We send one email a week with what moved and why.
Share This Report
Copy the link, post it, or save a PDF version.
On this page
Which models moved up, what’s new, and what it costs. One email a week, 3-min read.
Free. One email per week.
Full LLM API pricing comparison for 2026 — input/output token costs for GPT-5, Claude, Gemini, DeepSeek, Grok, and more. Find the cheapest model for your use case.
Which budget LLM should you use in 2026? We rank GPT-5.4 mini, GPT-5.4 nano, MiniMax M2.7, Claude Haiku 4.5, Gemini Flash, DeepSeek, and more by benchmarks and price.
GPT-5.4 and Gemini 3.1 Pro are one point apart on BenchLM's leaderboard. We compare coding, math, reasoning, multimodal, agentic, pricing, and context window — with Gemini Deep Think as the reasoning wildcard.