Self-Hosting vs API: Break-Even Calculator for Open-Weight LLMs
Compare monthly API costs against self-hosting costs — GPU rental or owned hardware, amortized with electricity and overhead. Pick a model, a config, your volume, and see the exact break-even point.
| Volume | API /mo | Self-host /mo | Winner |
|---|---|---|---|
| 1M/day | $0 | $2,278 | API |
| 10M/day | $0 | $2,278 | API |
| 50M/day | $0 | $2,278 | API |
| 100M/day | $0 | $2,278 | API |
| 500M/day | $0 | $2,278 | API |
When does self-hosting an LLM make sense?
Self-hosting an open-weight LLM trades the per-token pricing of a managed API for a roughly fixed monthly GPU cost. That trade makes sense once your sustained volume is high enough that the fixed cost beats the per-token bill. The exact crossover depends on three things: the API price per million tokens, the GPU config you need to run the model, and your utilization of that hardware.
For modern frontier open-weight models (GLM-5.1, DeepSeek V3, Llama 4 Maverick), you typically need 4–8 datacenter-class GPUs. That puts the monthly floor in the $2,000–$20,000 range depending on GPU class and rental vs own. API pricing for the same models is cheap enough that break-even often lands at tens to hundreds of millions of tokens per day. For smaller models (Gemma 4, Phi-4) that fit on a single consumer GPU, the floor is much lower and break-even arrives sooner.
What drives the break-even point?
Three levers: (1) the API's blended price per million tokens (70% input, 30% output is a reasonable default); (2) your self-host monthly cost, which is mostly GPU rental or amortized hardware plus electricity; and (3) overhead — engineer time, monitoring, egress. The tool exposes all three as inputs so you can test your real numbers.
Frequently asked questions
When does self-hosting an LLM actually save money?
Self-hosting starts beating API pricing once you have sustained high volume — roughly tens of millions of tokens per day for frontier open-weight models. Below that, API costs remain lower than renting dedicated GPUs. This calculator shows the exact break-even for your model and volume.
How much does it cost to self-host an LLM?
Monthly cost ranges from about $300 for a small model on consumer GPU rental to $20,000+ for a frontier model on an 8×H100 node. Cost depends on the GPU class, GPU count, mode (rent or own), and your overhead assumptions.
What GPU do I need to run Llama 4 Maverick?
Llama 4 Maverick has ~400B total parameters with 17B active. At FP8 quantization you need roughly 2×A100 80GB or better; at FP8 with full throughput, 8×H100 is the recommended config. The calculator auto-selects a sensible default per model.
Is renting GPUs or buying hardware cheaper?
For sustained 24/7 use over multiple years, buying hardware can be cheaper than renting — the calculator amortizes hardware over 4 years and includes electricity, PUE, and overhead. For burst or short-lived workloads, rental wins.
Can I self-host GPT-5 or Claude?
No. Proprietary models from OpenAI, Anthropic, Google, and xAI have no open weights. You can only use them via their APIs. Self-hosting applies to open-weight models like Llama, DeepSeek, Qwen, GLM, Kimi, Mistral, and Gemma.
Related tools
The AI models change fast. We track them for you.
For engineers, researchers, and the plain curious — a weekly brief on new models, ranking shifts, and pricing changes.
Free. No spam. Unsubscribe anytime.