Which budget LLM should you use in 2026? We rank GPT-5.4 mini, GPT-5.4 nano, MiniMax M2.7, Claude Haiku 4.5, Gemini Flash, DeepSeek, and more by benchmarks and price.
Share This Report
Copy the link, post it, or save a PDF version.
GPT-5.4 mini and nano just landed alongside MiniMax M2.7 — three new budget models in 48 hours. The capability floor keeps rising while prices drop. GPT-5.4 mini brings reasoning-class intelligence to $0.75/M input. MiniMax M2.7 quietly beats it on SWE-bench Pro at less than half the price.
This guide ranks every major LLM under $1.50 per million input tokens by benchmark performance, with pricing breakdowns and use-case recommendations. All scores from the BenchLM.ai leaderboard and pricing page.
There are now more than 15 models priced under $1.50/M input tokens. The quality range is enormous — from GPT-5 nano at $0.05/M input to Gemini 3.1 Pro at $1.25/M scoring 94 overall.
| Model | Creator | Input/Output | Context | Overall Score | Type |
|---|---|---|---|---|---|
| GPT-5 nano | OpenAI | $0.05/$0.40 | 400K | 36 | Reasoning |
| Seed 1.6 Flash | ByteDance | $0.08/$0.30 | 256K | — | |
| Gemini 3.1 Flash-Lite | $0.10/$0.40 | 1M | — | ||
| Step 3.5 Flash | StepFun | $0.10/$0.30 | 256K | — | |
| GPT-5.4 nano | OpenAI | $0.20/$1.25 | 400K | 58 | Reasoning |
| Mercury 2 | Inception | $0.25/$0.75 | 128K | — | |
| DeepSeek V3 | DeepSeek | $0.27/$1.10 | 128K | 49 | Non-Reasoning |
| DeepSeek Coder 2.0 | DeepSeek | $0.27/$1.10 | 128K | 62 | Non-Reasoning |
| MiniMax M2.7 | MiniMax | $0.30/$1.20 | 200K | 60* | Non-Reasoning |
| Grok 3 Mini | xAI | $0.30/$0.50 | 128K | 49* | Non-Reasoning |
*MiniMax M2.7 and Grok 3 Mini still have sparse coverage relative to the best-supported frontier rows, so treat their overall scores as directional rather than definitive.
| Model | Creator | Input/Output | Context | Overall Score | Type |
|---|---|---|---|---|---|
| Gemini 3 Flash | $0.50/$3.00 | 1M | 67 | Non-Reasoning | |
| Kimi K2.5 (superseded by K2.6) | Moonshot | $0.50/$2.80 | 256K | 68 | Non-Reasoning |
| DeepSeek R1 | DeepSeek | $0.55/$2.19 | 128K | 45 | Reasoning |
| GPT-5.4 mini | OpenAI | $0.75/$4.50 | 400K | 69 | Reasoning |
| Claude Haiku 4.5 | Anthropic | $1.00/$5.00 | 200K | 63 | Non-Reasoning |
| GLM-5-Turbo | Zhipu | $1.20/$4.00 | 200K | — | |
| Gemini 3.1 Pro | $1.25/$5.00 | 1M | 94 | Non-Reasoning |
For reference, the full GPT-5.4 costs $2.50/$15.00 and scores 88 overall. The jump from $1.25 (Gemini 3.1 Pro, score 94) to $2.50 (GPT-5.4, score 88) now actually favors Gemini on overall score — though GPT-5.4 still holds strong frontier-class individual benchmarks.
GPT-5.4 mini is OpenAI's reasoning model at budget pricing — $0.75/M input, 3.3x cheaper than GPT-5.4. It now scores 69 overall with a 400K context window.
Where mini stands out:
Where mini falls short:
The pitch: GPT-5.4 mini makes sense when you need a reasoning model with agentic capability at budget pricing. For pure knowledge or coding tasks, Gemini 3.1 Pro at $1.25 is stronger across the board.
GPT-5.4 nano costs $0.20/M input — 12.5x cheaper than full GPT-5.4. It now lands in the high-50s on BenchLM's overall score and materially outperforms the older GPT-5 nano budget row, but with a different capability profile.
Key scores:
| Benchmark | GPT-5.4 nano | GPT-5 nano | GPT-5.4 mini |
|---|---|---|---|
| GPQA | 82.8 | 71.2 | 88 |
| HLE | 37.7 | — | 41.5 |
| SWE-bench Pro | 52.4 | 22 | 54.4 |
| Terminal-Bench 2.0 | 46.3 | 38 | 60 |
| OSWorld-Verified | 39 | 30 | 72.1 |
| MMMU-Pro | 66.1 | 58 | 76.6 |
GPT-5.4 nano beats GPT-5 nano on every available benchmark — especially coding (SWE-bench Pro 52.4 vs 22) and knowledge (GPQA 82.8 vs 71.2). The gap is large enough that GPT-5.4 nano effectively replaces GPT-5 nano for anything beyond the cheapest possible classification tasks.
The cost math: At $0.20/M input, nano processes 5 million input tokens per dollar. For a classification pipeline handling 100M tokens/month, GPT-5.4 nano costs $20/month. GPT-5.4 mini would cost $75/month for the same volume. That 3.75x multiplier matters at scale.
Where nano makes sense: High-volume tasks where cost dominates — classification, tagging, simple extraction, content filtering. For anything requiring strong reasoning or coding, the step up to mini ($0.75) is worth the extra cost.
MiniMax M2.7 is the surprise of this batch. At $0.30/M input — cheaper than both GPT-5.4 mini and nano for quality coding — it posts the highest SWE-bench Pro score in the budget tier: 56.22.
| Benchmark | MiniMax M2.7 | GPT-5.4 mini | GPT-5.4 nano | Claude Haiku 4.5 |
|---|---|---|---|---|
| SWE-bench Pro | 56.22 | 54.4 | 52.4 | 46 |
| Terminal-Bench 2.0 | 57 | 60 | 46.3 | 53 |
| SWE-Multilingual | 76.5 | — | — | — |
| MLE-Bench-Lite | 66.6 | — | — | — |
| Toolathlon | 46.3 | 42.9 | 35.5 | — |
MiniMax M2.7 beats GPT-5.4 mini on SWE-bench Pro by nearly 2 points while costing 2.5x less on input tokens. On SWE-Multilingual (76.5) and MLE-Bench-Lite (66.6), it shows strong coding breadth that the OpenAI budget models haven't been tested on yet.
The caveat: MiniMax M2.7 still has sparse coverage relative to the best-supported frontier rows. Its 60 overall score is directionally useful, but it still rests on a narrow slice of coding and agentic evidence rather than broad cross-category coverage.
200K context is another differentiator. At $0.30/M input, feeding large codebases into M2.7 is dramatically cheaper than any alternative with comparable SWE-bench scores.
| Model | SWE-bench Pro | LiveCodeBench | Price (in/out) |
|---|---|---|---|
| MiniMax M2.7 | 56.22 | — | $0.30/$1.20 |
| GPT-5.4 mini | 54.4 | — | $0.75/$4.50 |
| GPT-5.4 nano | 52.4 | — | $0.20/$1.25 |
| Claude Haiku 4.5 | 46 | 36 | $1.00/$5.00 |
| Gemini 3 Flash | 44 | 36 | $0.50/$3.00 |
| DeepSeek V3 | — | 37.6 | $0.27/$1.10 |
MiniMax M2.7 leads. For budget coding workloads — code review, bug fixing, refactors — it's the best value option in the tier. GPT-5.4 mini is close behind with the added benefit of being a reasoning model.
| Model | Terminal-Bench 2.0 | OSWorld-Verified | Price (in/out) |
|---|---|---|---|
| GPT-5.4 mini | 60 | 72.2 | $0.75/$4.50 |
| MiniMax M2.7 | 57 | — | $0.30/$1.20 |
| Gemini 3 Flash | 56 | 53 | $0.50/$3.00 |
| Claude Haiku 4.5 | 41 | 57 | $1.00/$5.00 |
| GPT-5.4 nano | 46.3 | 39 | $0.20/$1.25 |
| GPT-5 nano | 38 | 30 | $0.05/$0.40 |
GPT-5.4 mini dominates agentic benchmarks in this tier. OSWorld-Verified 72.2 is a standout — closer to full GPT-5.4 (85) than any other budget model gets to its flagship sibling. If you're building an agent on a budget, mini is the pick.
| Model | GPQA | HLE | Price (in/out) |
|---|---|---|---|
| GPT-5.4 mini | 88 | 41.5 | $0.75/$4.50 |
| GPT-5.4 nano | 82.8 | 37.7 | $0.20/$1.25 |
| GPT-5 nano | 71.2 | — | $0.05/$0.40 |
| Gemini 3 Flash | 69 | 6 | $0.50/$3.00 |
| Claude Haiku 4.5 | 67 | 11 | $1.00/$5.00 |
| DeepSeek V3 | 59.1 | — | $0.27/$1.10 |
GPT-5.4 mini and nano dominate knowledge benchmarks in the budget tier. HLE scores of 41.5 and 37.7 are particularly impressive — Claude Haiku 4.5 scores 11 and Gemini 3 Flash scores 6 on the same benchmark.
| Model | MMMU-Pro | Price (in/out) |
|---|---|---|
| Claude Haiku 4.5 | 82 | $1.00/$5.00 |
| Gemini 3 Flash | 80 | $0.50/$3.00 |
| GPT-5.4 mini | 76.6 | $0.75/$4.50 |
| GPT-5.4 nano | 66.1 | $0.20/$1.25 |
| GPT-5 nano | 58 | $0.05/$0.40 |
Claude Haiku 4.5 and Gemini 3 Flash lead the budget tier on multimodal. MiniMax M2.7 has no MMMU-Pro score — another gap in its benchmark coverage.
High-volume classification and tagging — GPT-5 nano ($0.05/$0.40) or GPT-5.4 nano ($0.20/$1.25). If you're processing millions of tokens daily on simple tasks, nano-tier pricing is hard to argue with. GPT-5.4 nano is substantially better on quality if the 4x price increase fits your budget.
Budget coding assistant — MiniMax M2.7 ($0.30/$1.20). Highest SWE-bench Pro in the tier (56.22) at the second-lowest price. The 200K context window handles large codebases well. The caveat: limited benchmark coverage outside coding, so evaluate on your specific tasks.
Budget AI agent — GPT-5.4 mini ($0.75/$4.50). OSWorld-Verified 72.2 and Terminal-Bench 60 are the best agentic scores in the budget tier by a wide margin. The reasoning capability helps with multi-step agent workflows.
Long-context workloads — Gemini 3 Flash ($0.50/$3.00) with 1M context, or GPT-5.4 mini ($0.75/$4.50) with 400K. If you need 1M tokens of context at the cheapest possible price, Gemini 3 Flash is the only option. Gemini 3.1 Pro ($1.25/$5.00) also offers 1M context with much stronger benchmark scores.
Best budget all-rounder — Gemini 3.1 Pro ($1.25/$5.00). At 94 overall, it still scores higher than every other model in this guide. Full benchmark coverage across all categories, 1M context, and $1.25/M input. If you can afford $1.25 instead of $0.30, this is still the safest choice.
Cheapest reasoning model — GPT-5.4 nano ($0.20/$1.25). The only reasoning model under $0.50/M input with broad benchmark coverage. GPQA 82.8 and HLE 37.7 show real reasoning capability at an ultra-budget price.
MiniMax M2.7's current overall score still does not tell the full story. BenchLM.ai's ranking methodology requires breadth of benchmark coverage to produce a reliable overall score. MiniMax has a much better coding row than its general-purpose confidence level suggests.
On the benchmarks that do exist, M2.7 is competitive with or better than GPT-5.4 mini. SWE-bench Pro 56.22 and Terminal-Bench 57 are strong numbers. But without GPQA, HLE, AIME, MMMU-Pro, or instruction-following scores, it's impossible to rank M2.7 fairly against models with full coverage.
This is a recurring problem in AI benchmarking. As we covered in Are AI Benchmarks Reliable?, benchmark coverage and provenance matter as much as the scores themselves. A model with 10 strong scores and 20 unknowns is a riskier choice than a model with 25 moderate scores.
The practical takeaway: If your workload is coding or agentic tasks, MiniMax M2.7's published scores justify trying it. For general-purpose use, stick with models that have full benchmark coverage until more M2.7 data is available.
Three takeaways from this week's releases:
1. The reasoning gap is closing at the bottom. GPT-5.4 mini and nano bring reasoning-class capability to the budget tier. A year ago, reasoning models started at $2.50/M input. Now you can get HLE 37.7 for $0.20/M input.
2. Chinese models keep punching above on coding. MiniMax M2.7 posting the highest SWE-bench Pro score in the budget tier — above both GPT-5.4 mini and nano — continues the trend of Chinese labs producing strong coding models at aggressive price points.
3. Budget doesn't mean weak anymore. GPT-5.4 mini's OSWorld-Verified 72.2 would have been a frontier-class score 12 months ago. The models that cost $0.30–$0.75/M input today are materially better than the $15/M models of early 2025.
Check the BenchLM.ai leaderboard for the latest scores as more benchmarks roll in for these models. Prices and capabilities shift fast — what's budget today is obsolete tomorrow.
New models drop every week. We send one email a week with what moved and why.
Share This Report
Copy the link, post it, or save a PDF version.
On this page
Which models moved up, what’s new, and what it costs. One email a week, 3-min read.
Free. One email per week.
Which LLM is best for voice agents in 2026? We rank models by first-answer latency and output speed — the metrics that actually decide voice — name the fastest capable models, and compare the voice-agent platforms (Retell, Vapi, OpenAI Realtime, ElevenLabs).
Best LLM for math 2026: GPT-5.4 leads AIME 2025, MATH-500, and BRUMO. Compare Claude, Gemini, DeepSeek-R1, GPT-5.5, and value picks by use case.
Current DeepSeek API pricing from the official docs: deepseek-chat and deepseek-reasoner, cache-hit vs cache-miss pricing, output pricing, and the current V3.2 endpoint mapping.