Which budget LLM should you use in 2026? We rank GPT-5.4 mini, GPT-5.4 nano, MiniMax M2.7, Claude Haiku 4.5, Gemini Flash, DeepSeek, and more by benchmarks and price.
Share This Report
Copy the link, post it, or save a PDF version.
GPT-5.4 mini and nano just landed alongside MiniMax M2.7 — three new budget models in 48 hours. The capability floor keeps rising while prices drop. GPT-5.4 mini brings reasoning-class intelligence to $0.75/M input. MiniMax M2.7 quietly beats it on SWE-bench Pro at less than half the price.
This guide ranks every major LLM under $1.50 per million input tokens by benchmark performance, with pricing breakdowns and use-case recommendations. All scores from the BenchLM.ai leaderboard and pricing page.
There are now more than 15 models priced under $1.50/M input tokens. The quality range is enormous — from GPT-5 nano at $0.05/M input to Gemini 3.1 Pro at $1.25/M scoring 94 overall.
| Model | Creator | Input/Output | Context | Overall Score | Type |
|---|---|---|---|---|---|
| GPT-5 nano | OpenAI | $0.05/$0.40 | 400K | 36 | Reasoning |
| Seed 1.6 Flash | ByteDance | $0.08/$0.30 | 256K | — | |
| Gemini 3.1 Flash-Lite | $0.10/$0.40 | 1M | — | ||
| Step 3.5 Flash | StepFun | $0.10/$0.30 | 256K | — | |
| GPT-5.4 nano | OpenAI | $0.20/$1.25 | 400K | 58 | Reasoning |
| Mercury 2 | Inception | $0.25/$0.75 | 128K | — | |
| DeepSeek V3 | DeepSeek | $0.27/$1.10 | 128K | 49 | Non-Reasoning |
| DeepSeek Coder 2.0 | DeepSeek | $0.27/$1.10 | 128K | 62 | Non-Reasoning |
| MiniMax M2.7 | MiniMax | $0.30/$1.20 | 200K | 60* | Non-Reasoning |
| Grok 3 Mini | xAI | $0.30/$0.50 | 128K | 49* | Non-Reasoning |
*MiniMax M2.7 and Grok 3 Mini still have sparse coverage relative to the best-supported frontier rows, so treat their overall scores as directional rather than definitive.
| Model | Creator | Input/Output | Context | Overall Score | Type |
|---|---|---|---|---|---|
| Gemini 3 Flash | $0.50/$3.00 | 1M | 67 | Non-Reasoning | |
| Kimi K2.5 | Moonshot | $0.50/$2.80 | 128K | 72 | Non-Reasoning |
| DeepSeek R1 | DeepSeek | $0.55/$2.19 | 128K | 45 | Reasoning |
| GPT-5.4 mini | OpenAI | $0.75/$4.50 | 400K | 73 | Reasoning |
| Claude Haiku 4.5 | Anthropic | $1.00/$5.00 | 200K | 63 | Non-Reasoning |
| GLM-5-Turbo | Zhipu | $1.20/$4.00 | 200K | — | |
| Gemini 3.1 Pro | $1.25/$5.00 | 1M | 94 | Non-Reasoning |
For reference, the full GPT-5.4 costs $2.50/$15.00 and scores 94 overall. The jump from $1.25 (Gemini 3.1 Pro, score 94) to $2.50 (GPT-5.4, score 94) now actually favors Gemini on overall score — though GPT-5.4 still holds strong frontier-class individual benchmarks.
GPT-5.4 mini is OpenAI's reasoning model at budget pricing — $0.75/M input, 3.3x cheaper than GPT-5.4. It now scores 73 overall with a 400K context window.
Where mini stands out:
Where mini falls short:
The pitch: GPT-5.4 mini makes sense when you need a reasoning model with agentic capability at budget pricing. For pure knowledge or coding tasks, Gemini 3.1 Pro at $1.25 is stronger across the board.
GPT-5.4 nano costs $0.20/M input — 12.5x cheaper than full GPT-5.4. It now lands in the high-50s on BenchLM's overall score and materially outperforms the older GPT-5 nano budget row, but with a different capability profile.
Key scores:
| Benchmark | GPT-5.4 nano | GPT-5 nano | GPT-5.4 mini |
|---|---|---|---|
| GPQA | 82.8 | 71.2 | 88 |
| HLE | 37.7 | — | 41.5 |
| SWE-bench Pro | 52.4 | 22 | 54.4 |
| Terminal-Bench 2.0 | 46.3 | 38 | 60 |
| OSWorld-Verified | 39 | 30 | 72.1 |
| MMMU-Pro | 66.1 | 58 | 76.6 |
GPT-5.4 nano beats GPT-5 nano on every available benchmark — especially coding (SWE-bench Pro 52.4 vs 22) and knowledge (GPQA 82.8 vs 71.2). The gap is large enough that GPT-5.4 nano effectively replaces GPT-5 nano for anything beyond the cheapest possible classification tasks.
The cost math: At $0.20/M input, nano processes 5 million input tokens per dollar. For a classification pipeline handling 100M tokens/month, GPT-5.4 nano costs $20/month. GPT-5.4 mini would cost $75/month for the same volume. That 3.75x multiplier matters at scale.
Where nano makes sense: High-volume tasks where cost dominates — classification, tagging, simple extraction, content filtering. For anything requiring strong reasoning or coding, the step up to mini ($0.75) is worth the extra cost.
MiniMax M2.7 is the surprise of this batch. At $0.30/M input — cheaper than both GPT-5.4 mini and nano for quality coding — it posts the highest SWE-bench Pro score in the budget tier: 56.22.
| Benchmark | MiniMax M2.7 | GPT-5.4 mini | GPT-5.4 nano | Claude Haiku 4.5 |
|---|---|---|---|---|
| SWE-bench Pro | 56.22 | 54.4 | 52.4 | 46 |
| Terminal-Bench 2.0 | 57 | 60 | 46.3 | 53 |
| SWE-Multilingual | 76.5 | — | — | — |
| MLE-Bench-Lite | 66.6 | — | — | — |
| Toolathlon | 46.3 | 42.9 | 35.5 | — |
MiniMax M2.7 beats GPT-5.4 mini on SWE-bench Pro by nearly 2 points while costing 2.5x less on input tokens. On SWE-Multilingual (76.5) and MLE-Bench-Lite (66.6), it shows strong coding breadth that the OpenAI budget models haven't been tested on yet.
The caveat: MiniMax M2.7 still has sparse coverage relative to the best-supported frontier rows. Its 60 overall score is directionally useful, but it still rests on a narrow slice of coding and agentic evidence rather than broad cross-category coverage.
200K context is another differentiator. At $0.30/M input, feeding large codebases into M2.7 is dramatically cheaper than any alternative with comparable SWE-bench scores.
| Model | SWE-bench Pro | LiveCodeBench | Price (in/out) |
|---|---|---|---|
| MiniMax M2.7 | 56.22 | — | $0.30/$1.20 |
| GPT-5.4 mini | 54.4 | — | $0.75/$4.50 |
| GPT-5.4 nano | 52.4 | — | $0.20/$1.25 |
| Claude Haiku 4.5 | 46 | 36 | $1.00/$5.00 |
| Gemini 3 Flash | 44 | 36 | $0.50/$3.00 |
| DeepSeek V3 | — | 37.6 | $0.27/$1.10 |
MiniMax M2.7 leads. For budget coding workloads — code review, bug fixing, refactors — it's the best value option in the tier. GPT-5.4 mini is close behind with the added benefit of being a reasoning model.
| Model | Terminal-Bench 2.0 | OSWorld-Verified | Price (in/out) |
|---|---|---|---|
| GPT-5.4 mini | 60 | 72.2 | $0.75/$4.50 |
| MiniMax M2.7 | 57 | — | $0.30/$1.20 |
| Gemini 3 Flash | 56 | 53 | $0.50/$3.00 |
| Claude Haiku 4.5 | 41 | 57 | $1.00/$5.00 |
| GPT-5.4 nano | 46.3 | 39 | $0.20/$1.25 |
| GPT-5 nano | 38 | 30 | $0.05/$0.40 |
GPT-5.4 mini dominates agentic benchmarks in this tier. OSWorld-Verified 72.2 is a standout — closer to full GPT-5.4 (85) than any other budget model gets to its flagship sibling. If you're building an agent on a budget, mini is the pick.
| Model | GPQA | HLE | Price (in/out) |
|---|---|---|---|
| GPT-5.4 mini | 88 | 41.5 | $0.75/$4.50 |
| GPT-5.4 nano | 82.8 | 37.7 | $0.20/$1.25 |
| GPT-5 nano | 71.2 | — | $0.05/$0.40 |
| Gemini 3 Flash | 69 | 6 | $0.50/$3.00 |
| Claude Haiku 4.5 | 67 | 11 | $1.00/$5.00 |
| DeepSeek V3 | 59.1 | — | $0.27/$1.10 |
GPT-5.4 mini and nano dominate knowledge benchmarks in the budget tier. HLE scores of 41.5 and 37.7 are particularly impressive — Claude Haiku 4.5 scores 11 and Gemini 3 Flash scores 6 on the same benchmark.
| Model | MMMU-Pro | Price (in/out) |
|---|---|---|
| Claude Haiku 4.5 | 82 | $1.00/$5.00 |
| Gemini 3 Flash | 80 | $0.50/$3.00 |
| GPT-5.4 mini | 76.6 | $0.75/$4.50 |
| GPT-5.4 nano | 66.1 | $0.20/$1.25 |
| GPT-5 nano | 58 | $0.05/$0.40 |
Claude Haiku 4.5 and Gemini 3 Flash lead the budget tier on multimodal. MiniMax M2.7 has no MMMU-Pro score — another gap in its benchmark coverage.
High-volume classification and tagging — GPT-5 nano ($0.05/$0.40) or GPT-5.4 nano ($0.20/$1.25). If you're processing millions of tokens daily on simple tasks, nano-tier pricing is hard to argue with. GPT-5.4 nano is substantially better on quality if the 4x price increase fits your budget.
Budget coding assistant — MiniMax M2.7 ($0.30/$1.20). Highest SWE-bench Pro in the tier (56.22) at the second-lowest price. The 200K context window handles large codebases well. The caveat: limited benchmark coverage outside coding, so evaluate on your specific tasks.
Budget AI agent — GPT-5.4 mini ($0.75/$4.50). OSWorld-Verified 72.2 and Terminal-Bench 60 are the best agentic scores in the budget tier by a wide margin. The reasoning capability helps with multi-step agent workflows.
Long-context workloads — Gemini 3 Flash ($0.50/$3.00) with 1M context, or GPT-5.4 mini ($0.75/$4.50) with 400K. If you need 1M tokens of context at the cheapest possible price, Gemini 3 Flash is the only option. Gemini 3.1 Pro ($1.25/$5.00) also offers 1M context with much stronger benchmark scores.
Best budget all-rounder — Gemini 3.1 Pro ($1.25/$5.00). At 94 overall, it still scores higher than every other model in this guide. Full benchmark coverage across all categories, 1M context, and $1.25/M input. If you can afford $1.25 instead of $0.30, this is still the safest choice.
Cheapest reasoning model — GPT-5.4 nano ($0.20/$1.25). The only reasoning model under $0.50/M input with broad benchmark coverage. GPQA 82.8 and HLE 37.7 show real reasoning capability at an ultra-budget price.
MiniMax M2.7's current overall score still does not tell the full story. BenchLM.ai's ranking methodology requires breadth of benchmark coverage to produce a reliable overall score. MiniMax has a much better coding row than its general-purpose confidence level suggests.
On the benchmarks that do exist, M2.7 is competitive with or better than GPT-5.4 mini. SWE-bench Pro 56.22 and Terminal-Bench 57 are strong numbers. But without GPQA, HLE, AIME, MMMU-Pro, or instruction-following scores, it's impossible to rank M2.7 fairly against models with full coverage.
This is a recurring problem in AI benchmarking. As we covered in Are AI Benchmarks Reliable?, benchmark coverage and provenance matter as much as the scores themselves. A model with 10 strong scores and 20 unknowns is a riskier choice than a model with 25 moderate scores.
The practical takeaway: If your workload is coding or agentic tasks, MiniMax M2.7's published scores justify trying it. For general-purpose use, stick with models that have full benchmark coverage until more M2.7 data is available.
Three takeaways from this week's releases:
1. The reasoning gap is closing at the bottom. GPT-5.4 mini and nano bring reasoning-class capability to the budget tier. A year ago, reasoning models started at $2.50/M input. Now you can get HLE 37.7 for $0.20/M input.
2. Chinese models keep punching above on coding. MiniMax M2.7 posting the highest SWE-bench Pro score in the budget tier — above both GPT-5.4 mini and nano — continues the trend of Chinese labs producing strong coding models at aggressive price points.
3. Budget doesn't mean weak anymore. GPT-5.4 mini's OSWorld-Verified 72.2 would have been a frontier-class score 12 months ago. The models that cost $0.30–$0.75/M input today are materially better than the $15/M models of early 2025.
Check the BenchLM.ai leaderboard for the latest scores as more benchmarks roll in for these models. Prices and capabilities shift fast — what's budget today is obsolete tomorrow.
New models drop every week. We send one email a week with what moved and why.
Share This Report
Copy the link, post it, or save a PDF version.
On this page
Which models moved up, what’s new, and what it costs. One email a week, 3-min read.
Free. One email per week.
Current DeepSeek API pricing from the official docs: deepseek-chat and deepseek-reasoner, cache-hit vs cache-miss pricing, output pricing, and the current V3.2 endpoint mapping.
Which AI model is best for writing in 2026? We rank Claude, GPT, Gemini, and open source LLMs by creative writing Arena scores, instruction-following benchmarks, and real-world content quality — with pricing for every budget.
Which open source LLM is best in 2026? We rank the top open weight models by real benchmark data — GLM-5, Qwen3.5, Gemma 4, Kimi K2.5, Llama — and compare them to proprietary leaders.