Skip to main content

Self-Hosting vs API: Break-Even Calculator for Open-Weight LLMs

Compare monthly API costs against self-hosting costs — GPU rental or owned hardware, amortized with electricity and overhead. Pick a model, a config, your volume, and see the exact break-even point.

Task preset
Open-weight model
Mode
GPU config
Compare vs API model
API / mocheaper
$0.00
$0.00/yr
Self-host / mo
$2,278
$27,331/yr
Break-even
Blended 70/30 in/out. Above this volume, self-host saves money.
Cost at typical volumes
VolumeAPI /moSelf-host /moWinner
1M/day$0$2,278API
10M/day$0$2,278API
50M/day$0$2,278API
100M/day$0$2,278API
500M/day$0$2,278API

When does self-hosting an LLM make sense?

Self-hosting an open-weight LLM trades the per-token pricing of a managed API for a roughly fixed monthly GPU cost. That trade makes sense once your sustained volume is high enough that the fixed cost beats the per-token bill. The exact crossover depends on three things: the API price per million tokens, the GPU config you need to run the model, and your utilization of that hardware.

For modern frontier open-weight models (GLM-5.1, DeepSeek V3, Llama 4 Maverick), you typically need 4–8 datacenter-class GPUs. That puts the monthly floor in the $2,000–$20,000 range depending on GPU class and rental vs own. API pricing for the same models is cheap enough that break-even often lands at tens to hundreds of millions of tokens per day. For smaller models (Gemma 4, Phi-4) that fit on a single consumer GPU, the floor is much lower and break-even arrives sooner.

What drives the break-even point?

Three levers: (1) the API's blended price per million tokens (70% input, 30% output is a reasonable default); (2) your self-host monthly cost, which is mostly GPU rental or amortized hardware plus electricity; and (3) overhead — engineer time, monitoring, egress. The tool exposes all three as inputs so you can test your real numbers.

Frequently asked questions

When does self-hosting an LLM actually save money?

Self-hosting starts beating API pricing once you have sustained high volume — roughly tens of millions of tokens per day for frontier open-weight models. Below that, API costs remain lower than renting dedicated GPUs. This calculator shows the exact break-even for your model and volume.

How much does it cost to self-host an LLM?

Monthly cost ranges from about $300 for a small model on consumer GPU rental to $20,000+ for a frontier model on an 8×H100 node. Cost depends on the GPU class, GPU count, mode (rent or own), and your overhead assumptions.

What GPU do I need to run Llama 4 Maverick?

Llama 4 Maverick has ~400B total parameters with 17B active. At FP8 quantization you need roughly 2×A100 80GB or better; at FP8 with full throughput, 8×H100 is the recommended config. The calculator auto-selects a sensible default per model.

Is renting GPUs or buying hardware cheaper?

For sustained 24/7 use over multiple years, buying hardware can be cheaper than renting — the calculator amortizes hardware over 4 years and includes electricity, PUE, and overhead. For burst or short-lived workloads, rental wins.

Can I self-host GPT-5 or Claude?

No. Proprietary models from OpenAI, Anthropic, Google, and xAI have no open weights. You can only use them via their APIs. Self-hosting applies to open-weight models like Llama, DeepSeek, Qwen, GLM, Kimi, Mistral, and Gemma.

The AI models change fast. We track them for you.

For engineers, researchers, and the plain curious — a weekly brief on new models, ranking shifts, and pricing changes.

Free. No spam. Unsubscribe anytime.