How much does it cost to use an LLM API?

Prices range from $0.05 to $75 per million tokens depending on the model. Budget models like GPT-5 nano ($0.05/$0.40 per M tokens) cost fractions of a cent per request, while frontier models like Claude Opus 4.6 ($15/$75) cost significantly more. Most production workloads fall in the $1-$15 per million tokens range.

Why are output tokens more expensive than input tokens?

Output tokens require the model to generate each token sequentially by calculating probability distributions across its entire vocabulary for every single token. Input tokens only need to be processed once in a single forward pass. This autoregressive generation process is 2-5x more compute-intensive, which is why output tokens cost 2-5x more.

What are reasoning tokens and how do they affect cost?

Reasoning models like o3, o4-mini, and DeepSeek R1 generate internal 'thinking' tokens that are billed but may not appear in the visible response. These can multiply the effective output by 10-30x, dramatically increasing costs. For example, a simple query to o3 might generate 50 visible output tokens but 1,500 reasoning tokens behind the scenes.

How can I reduce my LLM API costs?

Six key strategies: prompt caching (up to 90% savings on repeated prefixes), batch processing (50% off for async jobs), model routing (use cheaper models for simpler tasks), prompt optimization (fewer tokens in and out), output length limits (set max_tokens), and using open-weight models for high-volume low-stakes tasks.

What is prompt caching and how much does it save?

Prompt caching stores the computed internal state for repeated prompt prefixes so the model doesn't have to reprocess them. Cache hits typically cost 10% of normal input token pricing — a 90% discount on the cached portion. Both Anthropic and OpenAI offer automatic prompt caching on their APIs.

How do I estimate the cost of an LLM-powered feature?

Use this formula: monthly cost = (average input tokens x input price per token) + (average output tokens x output price per token) x requests per month. For a quick estimate, use our cost calculator at benchlm.ai/tools/cost-calculator. Remember to factor in that reasoning models generate additional hidden tokens.

How many tokens are in 1,000 words?

Roughly 1,333 tokens. The rule of thumb is 1 token ≈ 0.75 words in English, so 1,000 words ≈ 1,333 tokens. This varies by model (different tokenizers) and content type — code and non-Latin scripts tend to use more tokens per word. Use our token counter at benchlm.ai/tools/token-counter to get exact counts.

What is the cheapest LLM API in 2026?

DeepSeek V3 ($0.27/$1.10 per M tokens) and GPT-5 nano ($0.05/$0.40) are the cheapest major APIs. DeepSeek V3 cache hits drop input costs to just $0.03 per M tokens. For embeddings, OpenAI text-embedding-3-small at $0.02/M tokens is the cheapest. Several providers also offer free tiers with limited usage.

Is it cheaper to self-host an LLM or use an API?

It depends on volume. Below ~50,000 daily requests, APIs are almost always cheaper. Above that threshold, self-hosting open-weight models like Llama 4 Maverick or DeepSeek V3 can be dramatically cheaper — but you need to factor in GPU costs ($1-$3/hr per GPU), engineering overhead, and the fact that you lose automatic model updates and scaling.

How much does it cost to process images with an LLM?

Claude and Gemini charge image inputs at the same per-token rate as text — images are converted to tokens (roughly 1,334 tokens per 1000x1000px image for Claude). OpenAI charges more for vision via dedicated image model variants. A typical image analysis request costs $0.002-$0.02 depending on the model and image size.

How LLM Token Pricing Works: A Complete Guide to API Costs in 2026

LLM APIs charge per token — typically $0.05 to $75 per million tokens depending on the model. A token is roughly 4 characters or 0.75 words. Here's exactly how pricing works, what drives the cost differences between models, and how to estimate and optimize your spend.

Want to check token counts right now? Try our free LLM token counter — paste any text and see counts across GPT-5, Claude, Gemini, and more.

What is a token?

A token is the basic unit of text that language models process. Rather than reading character-by-character or word-by-word, LLMs use tokenizers that split text into subword pieces.

Most modern LLMs use Byte-Pair Encoding (BPE), which learns common character sequences from training data. The result:

Common words like "the" or "and" → 1 token
Longer words like "hamburger" → 3 tokens ("Ham" + "bur" + "ger")
Rare technical terms may become 4-5 tokens

Rule of thumb: 1 token ≈ 4 characters ≈ 0.75 words in English.

Different models use different tokenizers, so the same text produces different token counts. The differences are usually within 5-10% for English text but can be larger for code, non-Latin scripts, or specialized terminology.

You can check exact counts with our LLM token counter, which uses real tokenizers for OpenAI models and calibrated estimates for others.

Input vs. output token pricing

Every LLM API charges separately for input tokens (your prompt, system message, and any context) and output tokens (the model's response). Output tokens always cost more — typically 2-5x the input price.

Why? Input tokens are processed in a single forward pass through the model. Output tokens require autoregressive generation: the model must predict each token one at a time, running a full probability calculation across its vocabulary for every single output token.

Here's what this looks like across major models:

Model	Input $/M	Output $/M	Output/Input Ratio	Context Window
GPT-5 nano	$0.05	$0.40	8x	128K
DeepSeek V3	$0.27	$1.10	4.1x	128K
Gemini 3.1 Flash	$0.15	$0.60	4x	1M
Gemini 3.1 Pro	$1.25	$5.00	4x	2M
GPT-5.4	$2.50	$15.00	6x	256K
Claude Sonnet 4.6	$3.00	$15.00	5x	200K
Claude Opus 4.6	$15.00	$75.00	5x	200K
GPT-5.4 Pro	$30.00	$180.00	6x	256K

For the complete breakdown of every model, see our LLM pricing comparison table.

Key takeaway: Context window size matters for cost. Models with larger windows (Gemini's 1-2M tokens) let you send more context per request, but more context means more input tokens billed. Your prompt design matters more than model choice for cost control — reducing output length (with max_tokens or explicit instructions) often saves more than switching models, because output tokens are the expensive part.

The hidden cost: reasoning tokens

Reasoning models — o3, o4-mini, DeepSeek R1, and similar "chain-of-thought" models — introduce a cost multiplier that surprises many developers.

These models generate internal reasoning tokens (sometimes called "thinking tokens") that are used for intermediate reasoning steps. These tokens are billed as output tokens but may not appear in the visible response. The ratio of reasoning tokens to visible output can be extreme:

A simple factual query might produce 50 visible tokens but 500-2,000 reasoning tokens
A complex math problem might produce 100 visible tokens but 5,000-15,000 reasoning tokens
Cost can be 10-30x what you'd expect from the visible output alone

When to use reasoning models:

Complex multi-step problems (math, logic, code debugging)
Tasks where accuracy is more important than cost
Problems that benefit from "thinking through" steps

When to avoid them:

Simple text generation, summarization, classification
High-volume production tasks where speed and cost matter
Tasks where non-reasoning models already achieve sufficient quality

For a side-by-side comparison of reasoning vs. non-reasoning models, check our reasoning model rankings and non-reasoning rankings.

Beyond text: vision, embedding, and fine-tuning costs

Token pricing for text generation is just one piece of the cost puzzle. If you're building production AI applications, you'll likely encounter these adjacent costs too.

Vision / image input pricing

Multimodal models can process images as input. How they charge varies by provider:

Claude and Gemini charge image inputs at the same per-token rate as text. Images are converted to tokens — roughly 1,334 tokens per 1000×1000px image for Claude, ~258 tokens per image for Gemini. This makes vision relatively cheap on these platforms.
OpenAI uses dedicated image model variants (like GPT-5 Image) with higher per-token rates — typically 4-8x the text input price.

Bottom line: If vision is a core part of your pipeline, Claude and Gemini are significantly cheaper for image processing. For occasional image analysis, the cost difference is negligible.

Embedding model pricing

Embeddings power search, RAG pipelines, and similarity matching. They're dramatically cheaper than text generation — typically $0.02-$0.20 per million tokens:

Model	$/M tokens
OpenAI text-embedding-3-small	$0.02
Voyage AI voyage-4-lite	$0.02
Voyage AI voyage-4	$0.06
OpenAI text-embedding-3-large	$0.13
Google Gemini Embedding 001	$0.15

Embedding costs are usually a rounding error compared to generation costs. For a RAG application doing 10,000 queries/day with 500 tokens per query, embedding costs would be ~$0.30/month with text-embedding-3-small. The generation step is where the real cost lives.

Fine-tuning costs

Fine-tuning lets you customize a model on your own data. It has two cost components:

Training cost: Charged per million tokens in your training dataset × number of epochs. OpenAI charges $3-$25/M tokens depending on the model (GPT-4o mini: $3/M, GPT-4o: $25/M).
Inference cost: Fine-tuned models typically cost 1.5-2x the base model's inference price.

Fine-tuning makes economic sense when it lets you use a smaller, cheaper model to achieve the quality of a larger one — or when you need specialized behavior that no amount of prompting can achieve. For most use cases, prompt engineering and few-shot examples are more cost-effective.

How to estimate costs for your use case

The basic formula:

Monthly cost = (avg input tokens × input price / 1M + avg output tokens × output price / 1M) × requests per month

Example 1: Customer support chatbot

Let's say you're building a chatbot that handles 1,000 conversations per day:

Average system prompt + context: 800 tokens (input)
Average user message: 200 tokens (input)
Average bot response: 300 tokens (output)
Conversations per day: 1,000

With GPT-5.4 ($2.50 input / $15.00 output per M tokens):

Daily input cost: 1,000 × 1,000 tokens × $2.50 / 1M = $2.50
Daily output cost: 1,000 × 300 tokens × $15.00 / 1M = $4.50
Monthly: ~$210

With DeepSeek V3 ($0.27 input / $1.10 output per M tokens):

Daily input cost: 1,000 × 1,000 × $0.27 / 1M = $0.27
Daily output cost: 1,000 × 300 × $1.10 / 1M = $0.33
Monthly: ~$18

That's a 12x cost difference for the same workload. Whether the quality difference justifies it depends on your use case.

Example 2: RAG application (document Q&A)

A retrieval-augmented generation app that answers questions over internal docs, handling 5,000 queries per day:

System prompt: 500 tokens (input)
Retrieved context chunks: 2,000 tokens (input)
User question: 100 tokens (input)
Answer: 400 tokens (output)
Embedding each query: 100 tokens

With Claude Sonnet 4.6 ($3.00 input / $15.00 output per M tokens) + text-embedding-3-small ($0.02/M):

Daily input cost: 5,000 × 2,600 × $3.00 / 1M = $39.00
Daily output cost: 5,000 × 400 × $15.00 / 1M = $30.00
Daily embedding cost: 5,000 × 100 × $0.02 / 1M = $0.01
Monthly: ~$2,070 (embeddings are negligible)

With prompt caching on the 500-token system prompt (90% savings on cached portion):

Cached input savings: 5,000 × 500 × $3.00 × 0.9 / 1M = $6.75/day saved
Monthly with caching: ~$1,868 (10% savings)

The real savings here come from optimizing your retrieval — sending 1,000 tokens of context instead of 2,000 cuts the input cost nearly in half.

Example 3: Batch content generation

Generating 500 product descriptions per day, each ~200 words output:

System prompt + product data: 600 tokens (input)
Generated description: 270 tokens (output)
Using batch API (50% discount)

With GPT-5.4 batch ($1.25 input / $7.50 output per M tokens):

Daily input cost: 500 × 600 × $1.25 / 1M = $0.38
Daily output cost: 500 × 270 × $7.50 / 1M = $1.01
Monthly: ~$42

With GPT-5 nano batch ($0.025 input / $0.20 output per M tokens):

Monthly: ~$1.50

For structured, repetitive content generation, budget models with batch APIs can be almost free.

For quick estimates without manual math, use our cost calculator.

6 ways to reduce LLM costs

1. Prompt caching

Both Anthropic and OpenAI offer automatic prompt caching that stores the computed state for repeated prompt prefixes. When the same prefix is reused (e.g., a long system prompt), cached input tokens cost ~10% of the normal price.

If your application uses a consistent system prompt or frequently includes the same context (like documentation or user history), caching can cut input costs by up to 90% on those cached portions.

2. Batch processing

OpenAI and Anthropic offer batch APIs with a 50% discount for non-real-time workloads. If your tasks don't need immediate responses — bulk classification, content generation, data extraction — batch processing halves your costs with no quality difference.

3. Model routing

Not every request needs a frontier model. A model router sends simple tasks (classification, extraction, formatting) to cheap fast models (GPT-5 nano, Gemini 3.1 Flash-Lite) and complex tasks (reasoning, creative writing, code generation) to frontier models.

This approach can cut overall costs by 60-80% while maintaining quality where it matters. See our best budget LLMs guide for recommended models at each price tier.

4. Prompt optimization

Shorter prompts = fewer input tokens = lower cost. Common wins:

Remove verbose instructions the model doesn't need
Use examples efficiently (1-2 instead of 5-6)
Structure context as key-value pairs instead of prose
Strip HTML/formatting from context documents before sending

5. Output length limits

Set max_tokens to cap the model's response length. Without it, models may generate unnecessarily verbose responses. For structured output (JSON, classifications), explicit length constraints can reduce output tokens by 50-80%.

6. Open-weight models for high-volume tasks

Models like Llama 4 Maverick, Qwen3.5 397B, and DeepSeek V3 are free to self-host. At high volumes (millions of requests/month), self-hosting can be dramatically cheaper than API pricing — though you need to factor in GPU infrastructure costs ($1-$3/hr per GPU). The break-even point is typically around 50,000+ daily requests for smaller models.

Calculate your break-even: Use the Self-Hosting vs API Calculator to estimate exactly when self-hosting a specific open-weight model starts beating API costs for your volume.

Free tiers and credits to get started

You don't need to spend money to start experimenting. Most providers offer free access:

Provider	Free offering	Best for
Google AI Studio	Free tier with rate limits (15 RPM for Flash, 2 RPM for Pro)	Prototyping with Gemini models
DeepSeek	5M free tokens, no credit card required	Budget-conscious production use
OpenAI	$5 credit for new accounts (expires in 3 months)	Testing GPT-5 models
Anthropic	$5 credit for new accounts	Testing Claude models
Mistral	Free tier via La Plateforme	Experimenting with open-weight alternatives
Voyage AI	200M free embedding tokens	Building RAG pipelines

Google AI Studio is the most generous for ongoing free usage. DeepSeek is the best option if you need low-cost production access — even after free credits expire, V3 at $0.27/$1.10 per M tokens is among the cheapest APIs available.

Rate limits: the hidden cost constraint

A model might be cheap per token but throttled so heavily that it can't serve your traffic — effectively making it more expensive because you need a fallback.

Rate limits vary dramatically by provider and spend tier:

Free tiers are heavily restricted: 3-15 requests per minute, making them suitable only for prototyping
Tier 1 (after first payment): typically 50-1,000 RPM, 500K-1M tokens per minute
Higher tiers (after $100-$400+ cumulative spend): 2,000-4,000 RPM with millions of TPM

Practical implications:

If you need to serve 100 concurrent users, free or Tier 1 limits won't cut it — budget for Tier 2+ spend levels
Prompt caching helps: only uncached tokens count toward Anthropic's input TPM limits, so high cache hit rates effectively multiply your throughput
Batch APIs have separate, higher limits and don't count against synchronous rate limits

Factor rate limits into your model selection alongside per-token pricing. The cheapest model on paper may not be viable if you can't get enough throughput at your spend level.

What's next

LLM pricing continues to drop rapidly. Input costs have fallen roughly 10x in the last 18 months, and the gap between frontier and budget models keeps narrowing. Track these shifts on our pricing trends page to see historical price curves across providers.

The best strategy today is to architect for flexibility — use model routing, monitor your actual token usage, and be ready to swap models as pricing changes.

To keep track of pricing shifts: check our live pricing table, count your tokens, estimate your costs, or subscribe to our newsletter for updates when models launch or prices change.