pricingtokenscostguideapi

How LLM Token Pricing Works: A Complete Guide to API Costs

Learn how LLM API pricing works — from tokens and input/output costs to prompt caching, batch discounts, and model routing. Practical tips to cut your AI spend.

Glevd·March 26, 2026·12 min read

Share This Report

Copy the link, post it, or save a PDF version.

Share on XShare on LinkedIn

LLM APIs charge per token — typically $0.05 to $75 per million tokens depending on the model. A token is roughly 4 characters or 0.75 words. Here's exactly how pricing works, what drives the cost differences between models, and how to estimate and optimize your spend.

Want to check token counts right now? Try our free LLM token counter — paste any text and see counts across GPT-5, Claude, Gemini, and more.

What is a token?

A token is the basic unit of text that language models process. Rather than reading character-by-character or word-by-word, LLMs use tokenizers that split text into subword pieces.

Most modern LLMs use Byte-Pair Encoding (BPE), which learns common character sequences from training data. The result:

  • Common words like "the" or "and" → 1 token
  • Longer words like "hamburger" → 3 tokens ("Ham" + "bur" + "ger")
  • Rare technical terms may become 4-5 tokens

Rule of thumb: 1 token ≈ 4 characters ≈ 0.75 words in English.

Different models use different tokenizers, so the same text produces different token counts. The differences are usually within 5-10% for English text but can be larger for code, non-Latin scripts, or specialized terminology.

You can check exact counts with our LLM token counter, which uses real tokenizers for OpenAI models and calibrated estimates for others.

Input vs. output token pricing

Every LLM API charges separately for input tokens (your prompt, system message, and any context) and output tokens (the model's response). Output tokens always cost more — typically 2-5x the input price.

Why? Input tokens are processed in a single forward pass through the model. Output tokens require autoregressive generation: the model must predict each token one at a time, running a full probability calculation across its vocabulary for every single output token.

Here's what this looks like across major models:

Model Input $/M Output $/M Output/Input Ratio
GPT-5 nano $0.05 $0.40 8x
DeepSeek V3 $0.27 $1.10 4.1x
Gemini 3.1 Pro $1.25 $5.00 4x
GPT-5.4 $2.50 $15.00 6x
Claude Sonnet 4.6 $3.00 $15.00 5x
Claude Opus 4.6 $15.00 $75.00 5x
GPT-5.4 Pro $30.00 $180.00 6x

For the complete breakdown of every model, see our LLM pricing comparison table.

The practical implication: your prompt design matters more than model choice for cost control. Reducing output length (with max_tokens or explicit instructions) often saves more than switching models, because output tokens are the expensive part.

The hidden cost: reasoning tokens

Reasoning models — o3, o4-mini, DeepSeek R1, and similar "chain-of-thought" models — introduce a cost multiplier that surprises many developers.

These models generate internal reasoning tokens (sometimes called "thinking tokens") that are used for intermediate reasoning steps. These tokens are billed as output tokens but may not appear in the visible response. The ratio of reasoning tokens to visible output can be extreme:

  • A simple factual query might produce 50 visible tokens but 500-2,000 reasoning tokens
  • A complex math problem might produce 100 visible tokens but 5,000-15,000 reasoning tokens
  • Cost can be 10-30x what you'd expect from the visible output alone

When to use reasoning models:

  • Complex multi-step problems (math, logic, code debugging)
  • Tasks where accuracy is more important than cost
  • Problems that benefit from "thinking through" steps

When to avoid them:

  • Simple text generation, summarization, classification
  • High-volume production tasks where speed and cost matter
  • Tasks where non-reasoning models already achieve sufficient quality

For a side-by-side comparison of reasoning vs. non-reasoning models, check our reasoning model rankings and non-reasoning rankings.

How to estimate costs for your use case

The basic formula:

Monthly cost = (avg input tokens × input price / 1M + avg output tokens × output price / 1M) × requests per month

Worked example: customer support chatbot

Let's say you're building a chatbot that handles 1,000 conversations per day:

  • Average system prompt + context: 800 tokens (input)
  • Average user message: 200 tokens (input)
  • Average bot response: 300 tokens (output)
  • Conversations per day: 1,000

With GPT-5.4 ($2.50 input / $15.00 output per M tokens):

  • Daily input cost: 1,000 × 1,000 tokens × $2.50 / 1M = $2.50
  • Daily output cost: 1,000 × 300 tokens × $15.00 / 1M = $4.50
  • Monthly: ~$210

With DeepSeek V3 ($0.27 input / $1.10 output per M tokens):

  • Daily input cost: 1,000 × 1,000 × $0.27 / 1M = $0.27
  • Daily output cost: 1,000 × 300 × $1.10 / 1M = $0.33
  • Monthly: ~$18

That's a 12x cost difference for the same workload. Whether the quality difference justifies it depends on your use case.

For quick estimates without manual math, use our LLM cost calculator or AI cost calculator which handles blog posts, docs, and feature development estimates.

6 ways to reduce LLM costs

1. Prompt caching

Both Anthropic and OpenAI offer automatic prompt caching that stores the computed state for repeated prompt prefixes. When the same prefix is reused (e.g., a long system prompt), cached input tokens cost ~10% of the normal price.

If your application uses a consistent system prompt or frequently includes the same context (like documentation or user history), caching can cut input costs by up to 90% on those cached portions.

2. Batch processing

OpenAI and Anthropic offer batch APIs with a 50% discount for non-real-time workloads. If your tasks don't need immediate responses — bulk classification, content generation, data extraction — batch processing halves your costs with no quality difference.

3. Model routing

Not every request needs a frontier model. A model router sends simple tasks (classification, extraction, formatting) to cheap fast models (GPT-5 nano, Gemini 3.1 Flash-Lite) and complex tasks (reasoning, creative writing, code generation) to frontier models.

This approach can cut overall costs by 60-80% while maintaining quality where it matters. See our best budget LLMs guide for recommended models at each price tier.

4. Prompt optimization

Shorter prompts = fewer input tokens = lower cost. Common wins:

  • Remove verbose instructions the model doesn't need
  • Use examples efficiently (1-2 instead of 5-6)
  • Structure context as key-value pairs instead of prose
  • Strip HTML/formatting from context documents before sending

5. Output length limits

Set max_tokens to cap the model's response length. Without it, models may generate unnecessarily verbose responses. For structured output (JSON, classifications), explicit length constraints can reduce output tokens by 50-80%.

6. Open-weight models for high-volume tasks

Models like Llama 4 Maverick, Qwen3.5 397B, and DeepSeek V3 are free to self-host. At high volumes (millions of requests/month), self-hosting can be dramatically cheaper than API pricing — though you need to factor in GPU infrastructure costs.

What's next

LLM pricing continues to drop rapidly. Input costs have fallen roughly 10x in the last 18 months, and the gap between frontier and budget models keeps narrowing. The best strategy today is to architect for flexibility — use model routing, monitor your actual token usage, and be ready to swap models as pricing changes.

To keep track of pricing shifts: check our live pricing table, count your tokens, or subscribe to our newsletter for updates when models launch or prices change.

Enjoyed this post?

Get weekly benchmark updates in your inbox.

Free. No spam. Unsubscribe anytime. We only store derived location metadata for consent routing.

Share This Report

Copy the link, post it, or save a PDF version.

More posts
Share on XShare on LinkedIn

Weekly LLM Updates

New model releases, benchmark scores, and leaderboard changes. Every Friday.

Free. Your signup is stored with a derived country code for compliance routing.