Skip to main content
pricingtokenscostguideapiembeddingsvisionfine-tuningcost optimizationfree tier

How LLM Token Pricing Works: A Complete Guide to API Costs in 2026

Learn how LLM API pricing works — from tokens, input/output costs, and reasoning tokens to vision, embedding, and fine-tuning pricing. Includes real cost examples, free tiers, and 6 strategies to cut your AI spend.

Glevd·Published March 26, 2026·Updated April 9, 2026·18 min read

Share This Report

Copy the link, post it, or save a PDF version.

Share on XShare on LinkedIn

LLM APIs charge per token — typically $0.05 to $75 per million tokens depending on the model. A token is roughly 4 characters or 0.75 words. Here's exactly how pricing works, what drives the cost differences between models, and how to estimate and optimize your spend.

Want to check token counts right now? Try our free LLM token counter — paste any text and see counts across GPT-5, Claude, Gemini, and more.

What is a token?

A token is the basic unit of text that language models process. Rather than reading character-by-character or word-by-word, LLMs use tokenizers that split text into subword pieces.

Most modern LLMs use Byte-Pair Encoding (BPE), which learns common character sequences from training data. The result:

  • Common words like "the" or "and" → 1 token
  • Longer words like "hamburger" → 3 tokens ("Ham" + "bur" + "ger")
  • Rare technical terms may become 4-5 tokens

Rule of thumb: 1 token ≈ 4 characters ≈ 0.75 words in English.

Different models use different tokenizers, so the same text produces different token counts. The differences are usually within 5-10% for English text but can be larger for code, non-Latin scripts, or specialized terminology.

You can check exact counts with our LLM token counter, which uses real tokenizers for OpenAI models and calibrated estimates for others.

Input vs. output token pricing

Every LLM API charges separately for input tokens (your prompt, system message, and any context) and output tokens (the model's response). Output tokens always cost more — typically 2-5x the input price.

Why? Input tokens are processed in a single forward pass through the model. Output tokens require autoregressive generation: the model must predict each token one at a time, running a full probability calculation across its vocabulary for every single output token.

Here's what this looks like across major models:

Model Input $/M Output $/M Output/Input Ratio Context Window
GPT-5 nano $0.05 $0.40 8x 128K
DeepSeek V3 $0.27 $1.10 4.1x 128K
Gemini 3.1 Flash $0.15 $0.60 4x 1M
Gemini 3.1 Pro $1.25 $5.00 4x 2M
GPT-5.4 $2.50 $15.00 6x 256K
Claude Sonnet 4.6 $3.00 $15.00 5x 200K
Claude Opus 4.6 $15.00 $75.00 5x 200K
GPT-5.4 Pro $30.00 $180.00 6x 256K

For the complete breakdown of every model, see our LLM pricing comparison table.

Key takeaway: Context window size matters for cost. Models with larger windows (Gemini's 1-2M tokens) let you send more context per request, but more context means more input tokens billed. Your prompt design matters more than model choice for cost control — reducing output length (with max_tokens or explicit instructions) often saves more than switching models, because output tokens are the expensive part.

The hidden cost: reasoning tokens

Reasoning models — o3, o4-mini, DeepSeek R1, and similar "chain-of-thought" models — introduce a cost multiplier that surprises many developers.

These models generate internal reasoning tokens (sometimes called "thinking tokens") that are used for intermediate reasoning steps. These tokens are billed as output tokens but may not appear in the visible response. The ratio of reasoning tokens to visible output can be extreme:

  • A simple factual query might produce 50 visible tokens but 500-2,000 reasoning tokens
  • A complex math problem might produce 100 visible tokens but 5,000-15,000 reasoning tokens
  • Cost can be 10-30x what you'd expect from the visible output alone

When to use reasoning models:

  • Complex multi-step problems (math, logic, code debugging)
  • Tasks where accuracy is more important than cost
  • Problems that benefit from "thinking through" steps

When to avoid them:

  • Simple text generation, summarization, classification
  • High-volume production tasks where speed and cost matter
  • Tasks where non-reasoning models already achieve sufficient quality

For a side-by-side comparison of reasoning vs. non-reasoning models, check our reasoning model rankings and non-reasoning rankings.

Beyond text: vision, embedding, and fine-tuning costs

Token pricing for text generation is just one piece of the cost puzzle. If you're building production AI applications, you'll likely encounter these adjacent costs too.

Vision / image input pricing

Multimodal models can process images as input. How they charge varies by provider:

  • Claude and Gemini charge image inputs at the same per-token rate as text. Images are converted to tokens — roughly 1,334 tokens per 1000×1000px image for Claude, ~258 tokens per image for Gemini. This makes vision relatively cheap on these platforms.
  • OpenAI uses dedicated image model variants (like GPT-5 Image) with higher per-token rates — typically 4-8x the text input price.

Bottom line: If vision is a core part of your pipeline, Claude and Gemini are significantly cheaper for image processing. For occasional image analysis, the cost difference is negligible.

Embedding model pricing

Embeddings power search, RAG pipelines, and similarity matching. They're dramatically cheaper than text generation — typically $0.02-$0.20 per million tokens:

Model $/M tokens
OpenAI text-embedding-3-small $0.02
Voyage AI voyage-4-lite $0.02
Voyage AI voyage-4 $0.06
OpenAI text-embedding-3-large $0.13
Google Gemini Embedding 001 $0.15

Embedding costs are usually a rounding error compared to generation costs. For a RAG application doing 10,000 queries/day with 500 tokens per query, embedding costs would be ~$0.30/month with text-embedding-3-small. The generation step is where the real cost lives.

Fine-tuning costs

Fine-tuning lets you customize a model on your own data. It has two cost components:

  1. Training cost: Charged per million tokens in your training dataset × number of epochs. OpenAI charges $3-$25/M tokens depending on the model (GPT-4o mini: $3/M, GPT-4o: $25/M).
  2. Inference cost: Fine-tuned models typically cost 1.5-2x the base model's inference price.

Fine-tuning makes economic sense when it lets you use a smaller, cheaper model to achieve the quality of a larger one — or when you need specialized behavior that no amount of prompting can achieve. For most use cases, prompt engineering and few-shot examples are more cost-effective.

How to estimate costs for your use case

The basic formula:

Monthly cost = (avg input tokens × input price / 1M + avg output tokens × output price / 1M) × requests per month

Example 1: Customer support chatbot

Let's say you're building a chatbot that handles 1,000 conversations per day:

  • Average system prompt + context: 800 tokens (input)
  • Average user message: 200 tokens (input)
  • Average bot response: 300 tokens (output)
  • Conversations per day: 1,000

With GPT-5.4 ($2.50 input / $15.00 output per M tokens):

  • Daily input cost: 1,000 × 1,000 tokens × $2.50 / 1M = $2.50
  • Daily output cost: 1,000 × 300 tokens × $15.00 / 1M = $4.50
  • Monthly: ~$210

With DeepSeek V3 ($0.27 input / $1.10 output per M tokens):

  • Daily input cost: 1,000 × 1,000 × $0.27 / 1M = $0.27
  • Daily output cost: 1,000 × 300 × $1.10 / 1M = $0.33
  • Monthly: ~$18

That's a 12x cost difference for the same workload. Whether the quality difference justifies it depends on your use case.

Example 2: RAG application (document Q&A)

A retrieval-augmented generation app that answers questions over internal docs, handling 5,000 queries per day:

  • System prompt: 500 tokens (input)
  • Retrieved context chunks: 2,000 tokens (input)
  • User question: 100 tokens (input)
  • Answer: 400 tokens (output)
  • Embedding each query: 100 tokens

With Claude Sonnet 4.6 ($3.00 input / $15.00 output per M tokens) + text-embedding-3-small ($0.02/M):

  • Daily input cost: 5,000 × 2,600 × $3.00 / 1M = $39.00
  • Daily output cost: 5,000 × 400 × $15.00 / 1M = $30.00
  • Daily embedding cost: 5,000 × 100 × $0.02 / 1M = $0.01
  • Monthly: ~$2,070 (embeddings are negligible)

With prompt caching on the 500-token system prompt (90% savings on cached portion):

  • Cached input savings: 5,000 × 500 × $3.00 × 0.9 / 1M = $6.75/day saved
  • Monthly with caching: ~$1,868 (10% savings)

The real savings here come from optimizing your retrieval — sending 1,000 tokens of context instead of 2,000 cuts the input cost nearly in half.

Example 3: Batch content generation

Generating 500 product descriptions per day, each ~200 words output:

  • System prompt + product data: 600 tokens (input)
  • Generated description: 270 tokens (output)
  • Using batch API (50% discount)

With GPT-5.4 batch ($1.25 input / $7.50 output per M tokens):

  • Daily input cost: 500 × 600 × $1.25 / 1M = $0.38
  • Daily output cost: 500 × 270 × $7.50 / 1M = $1.01
  • Monthly: ~$42

With GPT-5 nano batch ($0.025 input / $0.20 output per M tokens):

  • Monthly: ~$1.50

For structured, repetitive content generation, budget models with batch APIs can be almost free.

For quick estimates without manual math, use our cost calculator.

6 ways to reduce LLM costs

1. Prompt caching

Both Anthropic and OpenAI offer automatic prompt caching that stores the computed state for repeated prompt prefixes. When the same prefix is reused (e.g., a long system prompt), cached input tokens cost ~10% of the normal price.

If your application uses a consistent system prompt or frequently includes the same context (like documentation or user history), caching can cut input costs by up to 90% on those cached portions.

2. Batch processing

OpenAI and Anthropic offer batch APIs with a 50% discount for non-real-time workloads. If your tasks don't need immediate responses — bulk classification, content generation, data extraction — batch processing halves your costs with no quality difference.

3. Model routing

Not every request needs a frontier model. A model router sends simple tasks (classification, extraction, formatting) to cheap fast models (GPT-5 nano, Gemini 3.1 Flash-Lite) and complex tasks (reasoning, creative writing, code generation) to frontier models.

This approach can cut overall costs by 60-80% while maintaining quality where it matters. See our best budget LLMs guide for recommended models at each price tier.

4. Prompt optimization

Shorter prompts = fewer input tokens = lower cost. Common wins:

  • Remove verbose instructions the model doesn't need
  • Use examples efficiently (1-2 instead of 5-6)
  • Structure context as key-value pairs instead of prose
  • Strip HTML/formatting from context documents before sending

5. Output length limits

Set max_tokens to cap the model's response length. Without it, models may generate unnecessarily verbose responses. For structured output (JSON, classifications), explicit length constraints can reduce output tokens by 50-80%.

6. Open-weight models for high-volume tasks

Models like Llama 4 Maverick, Qwen3.5 397B, and DeepSeek V3 are free to self-host. At high volumes (millions of requests/month), self-hosting can be dramatically cheaper than API pricing — though you need to factor in GPU infrastructure costs ($1-$3/hr per GPU). The break-even point is typically around 50,000+ daily requests for smaller models.

Free tiers and credits to get started

You don't need to spend money to start experimenting. Most providers offer free access:

Provider Free offering Best for
Google AI Studio Free tier with rate limits (15 RPM for Flash, 2 RPM for Pro) Prototyping with Gemini models
DeepSeek 5M free tokens, no credit card required Budget-conscious production use
OpenAI $5 credit for new accounts (expires in 3 months) Testing GPT-5 models
Anthropic $5 credit for new accounts Testing Claude models
Mistral Free tier via La Plateforme Experimenting with open-weight alternatives
Voyage AI 200M free embedding tokens Building RAG pipelines

Google AI Studio is the most generous for ongoing free usage. DeepSeek is the best option if you need low-cost production access — even after free credits expire, V3 at $0.27/$1.10 per M tokens is among the cheapest APIs available.

Rate limits: the hidden cost constraint

A model might be cheap per token but throttled so heavily that it can't serve your traffic — effectively making it more expensive because you need a fallback.

Rate limits vary dramatically by provider and spend tier:

  • Free tiers are heavily restricted: 3-15 requests per minute, making them suitable only for prototyping
  • Tier 1 (after first payment): typically 50-1,000 RPM, 500K-1M tokens per minute
  • Higher tiers (after $100-$400+ cumulative spend): 2,000-4,000 RPM with millions of TPM

Practical implications:

  • If you need to serve 100 concurrent users, free or Tier 1 limits won't cut it — budget for Tier 2+ spend levels
  • Prompt caching helps: only uncached tokens count toward Anthropic's input TPM limits, so high cache hit rates effectively multiply your throughput
  • Batch APIs have separate, higher limits and don't count against synchronous rate limits

Factor rate limits into your model selection alongside per-token pricing. The cheapest model on paper may not be viable if you can't get enough throughput at your spend level.

What's next

LLM pricing continues to drop rapidly. Input costs have fallen roughly 10x in the last 18 months, and the gap between frontier and budget models keeps narrowing. Track these shifts on our pricing trends page to see historical price curves across providers.

The best strategy today is to architect for flexibility — use model routing, monitor your actual token usage, and be ready to swap models as pricing changes.

To keep track of pricing shifts: check our live pricing table, count your tokens, estimate your costs, or subscribe to our newsletter for updates when models launch or prices change.

Model pricing changes frequently. We send one email a week with what moved and why.