Which budget LLM should you use in 2026? We rank GPT-5.4 mini, GPT-5.4 nano, MiniMax M2.7, Claude Haiku 4.5, Gemini Flash, DeepSeek, and more by benchmarks and price.
GPT-5.4 mini and nano just landed alongside MiniMax M2.7 — three new budget models in 48 hours. The capability floor keeps rising while prices drop. GPT-5.4 mini brings reasoning-class intelligence to $0.75/M input. MiniMax M2.7 quietly beats it on SWE-bench Pro at less than half the price.
This guide ranks every major LLM under $1.50 per million input tokens by benchmark performance, with pricing breakdowns and use-case recommendations. All scores from the BenchLM.ai leaderboard and pricing page.
There are now more than 15 models priced under $1.50/M input tokens. The quality range is enormous — from GPT-5 nano at $0.05/M input to Gemini 3.1 Pro at $1.25/M scoring 84 overall.
| Model | Creator | Input/Output | Context | Overall Score | Type |
|---|---|---|---|---|---|
| GPT-5 nano | OpenAI | $0.05/$0.40 | 400K | 40 | Reasoning |
| Seed 1.6 Flash | ByteDance | $0.08/$0.30 | 256K | — | |
| Gemini 3.1 Flash-Lite | $0.10/$0.40 | 1M | — | ||
| Step 3.5 Flash | StepFun | $0.10/$0.30 | 256K | — | |
| GPT-5.4 nano | OpenAI | $0.20/$1.25 | 400K | 41 | Reasoning |
| Mercury 2 | Inception | $0.25/$0.75 | 128K | — | |
| DeepSeek V3 | DeepSeek | $0.27/$1.10 | 128K | 31 | Non-Reasoning |
| DeepSeek Coder 2.0 | DeepSeek | $0.27/$1.10 | 128K | — | |
| MiniMax M2.7 | MiniMax | $0.30/$1.20 | 200K | 24* | Non-Reasoning |
| Grok 3 Mini | xAI | $0.30/$0.50 | 128K | 21 | Non-Reasoning |
*MiniMax M2.7 overall score based on only 10 benchmarks — low confidence.
| Model | Creator | Input/Output | Context | Overall Score | Type |
|---|---|---|---|---|---|
| Gemini 3 Flash | $0.50/$3.00 | 1M | 62 | Non-Reasoning | |
| Kimi K2.5 | Moonshot | $0.50/$2.80 | 128K | — | |
| DeepSeek R1 | DeepSeek | $0.55/$2.19 | 128K | — | Reasoning |
| GPT-5.4 mini | OpenAI | $0.75/$4.50 | 400K | 49 | Reasoning |
| Claude Haiku 4.5 | Anthropic | $0.80/$4.00 | 200K | 62 | Non-Reasoning |
| GLM-5-Turbo | Zhipu | $1.20/$4.00 | 200K | — | |
| Gemini 3.1 Pro | $1.25/$5.00 | 1M | 84 | Non-Reasoning |
For reference, the full GPT-5.4 costs $2.50/$15.00 and scores 90 overall. The jump from $1.25 (Gemini 3.1 Pro, score 84) to $2.50 (GPT-5.4, score 90) is where the budget tier ends and the production-frontier begins.
GPT-5.4 mini is OpenAI's reasoning model at budget pricing — $0.75/M input, 3.3x cheaper than GPT-5.4. It scores 49 overall with a 400K context window.
Where mini stands out:
Where mini falls short:
The pitch: GPT-5.4 mini makes sense when you need a reasoning model with agentic capability at budget pricing. For pure knowledge or coding tasks, Gemini 3.1 Pro at $1.25 is stronger across the board.
GPT-5.4 nano costs $0.20/M input — 12.5x cheaper than full GPT-5.4. It scores 41 overall, roughly matching GPT-5 nano (40) but with a different capability profile.
Key scores:
| Benchmark | GPT-5.4 nano | GPT-5 nano | GPT-5.4 mini |
|---|---|---|---|
| GPQA | 82.8 | 71.2 | 88 |
| HLE | 37.7 | — | 41.5 |
| SWE-bench Pro | 52.4 | 22 | 54.4 |
| Terminal-Bench 2.0 | 46.3 | 38 | 60 |
| OSWorld-Verified | 39 | 30 | 72.1 |
| MMMU-Pro | 66.1 | 58 | 76.6 |
GPT-5.4 nano beats GPT-5 nano on every available benchmark — especially coding (SWE-bench Pro 52.4 vs 22) and knowledge (GPQA 82.8 vs 71.2). The gap is large enough that GPT-5.4 nano effectively replaces GPT-5 nano for anything beyond the cheapest possible classification tasks.
The cost math: At $0.20/M input, nano processes 5 million input tokens per dollar. For a classification pipeline handling 100M tokens/month, GPT-5.4 nano costs $20/month. GPT-5.4 mini would cost $75/month for the same volume. That 3.75x multiplier matters at scale.
Where nano makes sense: High-volume tasks where cost dominates — classification, tagging, simple extraction, content filtering. For anything requiring strong reasoning or coding, the step up to mini ($0.75) is worth the extra cost.
MiniMax M2.7 is the surprise of this batch. At $0.30/M input — cheaper than both GPT-5.4 mini and nano for quality coding — it posts the highest SWE-bench Pro score in the budget tier: 56.22.
| Benchmark | MiniMax M2.7 | GPT-5.4 mini | GPT-5.4 nano | Claude Haiku 4.5 |
|---|---|---|---|---|
| SWE-bench Pro | 56.22 | 54.4 | 52.4 | 46 |
| Terminal-Bench 2.0 | 57 | 60 | 46.3 | 53 |
| SWE-Multilingual | 76.5 | — | — | — |
| MLE-Bench-Lite | 66.6 | — | — | — |
| Toolathlon | 46.3 | 42.9 | 35.5 | — |
MiniMax M2.7 beats GPT-5.4 mini on SWE-bench Pro by nearly 2 points while costing 2.5x less on input tokens. On SWE-Multilingual (76.5) and MLE-Bench-Lite (66.6), it shows strong coding breadth that the OpenAI budget models haven't been tested on yet.
The caveat: MiniMax M2.7 has only 10 benchmark results published. There's no GPQA, no HLE, no MMMU-Pro, no AIME data. The 24 overall score reflects sparse coverage, not necessarily weak performance. For coding and agentic tasks specifically, the data that exists is strong. For everything else — knowledge, math, reasoning, instruction following — we simply don't have enough signal.
200K context is another differentiator. At $0.30/M input, feeding large codebases into M2.7 is dramatically cheaper than any alternative with comparable SWE-bench scores.
| Model | SWE-bench Pro | LiveCodeBench | Price (in/out) |
|---|---|---|---|
| MiniMax M2.7 | 56.22 | — | $0.30/$1.20 |
| GPT-5.4 mini | 54.4 | — | $0.75/$4.50 |
| GPT-5.4 nano | 52.4 | — | $0.20/$1.25 |
| Claude Haiku 4.5 | 46 | 36 | $0.80/$4.00 |
| Gemini 3 Flash | 44 | 36 | $0.50/$3.00 |
| DeepSeek V3 | — | 37.6 | $0.27/$1.10 |
MiniMax M2.7 leads. For budget coding workloads — code review, bug fixing, refactors — it's the best value option in the tier. GPT-5.4 mini is close behind with the added benefit of being a reasoning model.
| Model | Terminal-Bench 2.0 | OSWorld-Verified | Price (in/out) |
|---|---|---|---|
| GPT-5.4 mini | 60 | 72.1 | $0.75/$4.50 |
| MiniMax M2.7 | 57 | — | $0.30/$1.20 |
| Gemini 3 Flash | 56 | 53 | $0.50/$3.00 |
| Claude Haiku 4.5 | 53 | 57 | $0.80/$4.00 |
| GPT-5.4 nano | 46.3 | 39 | $0.20/$1.25 |
| GPT-5 nano | 38 | 30 | $0.05/$0.40 |
GPT-5.4 mini dominates agentic benchmarks in this tier. OSWorld-Verified 72.1 is a standout — closer to full GPT-5.4 (85) than any other budget model gets to its flagship sibling. If you're building an agent on a budget, mini is the pick.
| Model | GPQA | HLE | Price (in/out) |
|---|---|---|---|
| GPT-5.4 mini | 88 | 41.5 | $0.75/$4.50 |
| GPT-5.4 nano | 82.8 | 37.7 | $0.20/$1.25 |
| GPT-5 nano | 71.2 | — | $0.05/$0.40 |
| Gemini 3 Flash | 69 | 6 | $0.50/$3.00 |
| Claude Haiku 4.5 | 67 | 11 | $0.80/$4.00 |
| DeepSeek V3 | 59.1 | — | $0.27/$1.10 |
GPT-5.4 mini and nano dominate knowledge benchmarks in the budget tier. HLE scores of 41.5 and 37.7 are particularly impressive — Claude Haiku 4.5 scores 11 and Gemini 3 Flash scores 6 on the same benchmark.
| Model | MMMU-Pro | Price (in/out) |
|---|---|---|
| Claude Haiku 4.5 | 82 | $0.80/$4.00 |
| Gemini 3 Flash | 80 | $0.50/$3.00 |
| GPT-5.4 mini | 76.6 | $0.75/$4.50 |
| GPT-5.4 nano | 66.1 | $0.20/$1.25 |
| GPT-5 nano | 58 | $0.05/$0.40 |
Claude Haiku 4.5 and Gemini 3 Flash lead the budget tier on multimodal. MiniMax M2.7 has no MMMU-Pro score — another gap in its benchmark coverage.
High-volume classification and tagging — GPT-5 nano ($0.05/$0.40) or GPT-5.4 nano ($0.20/$1.25). If you're processing millions of tokens daily on simple tasks, nano-tier pricing is hard to argue with. GPT-5.4 nano is substantially better on quality if the 4x price increase fits your budget.
Budget coding assistant — MiniMax M2.7 ($0.30/$1.20). Highest SWE-bench Pro in the tier (56.22) at the second-lowest price. The 200K context window handles large codebases well. The caveat: limited benchmark coverage outside coding, so evaluate on your specific tasks.
Budget AI agent — GPT-5.4 mini ($0.75/$4.50). OSWorld-Verified 72.1 and Terminal-Bench 60 are the best agentic scores in the budget tier by a wide margin. The reasoning capability helps with multi-step agent workflows.
Long-context workloads — Gemini 3 Flash ($0.50/$3.00) with 1M context, or GPT-5.4 mini ($0.75/$4.50) with 400K. If you need 1M tokens of context at the cheapest possible price, Gemini 3 Flash is the only option. Gemini 3.1 Pro ($1.25/$5.00) also offers 1M context with much stronger benchmark scores.
Best budget all-rounder — Gemini 3.1 Pro ($1.25/$5.00). At 84 overall, it scores higher than every other model in this guide combined. Full benchmark coverage across all categories, 1M context, and $1.25/M input. If you can afford $1.25 instead of $0.30, this is the safest choice.
Cheapest reasoning model — GPT-5.4 nano ($0.20/$1.25). The only reasoning model under $0.50/M input with broad benchmark coverage. GPQA 82.8 and HLE 37.7 show real reasoning capability at an ultra-budget price.
MiniMax M2.7's 24 overall score doesn't tell the full story. BenchLM.ai's ranking methodology requires breadth of benchmark coverage to produce a reliable overall score. With only 10 benchmarks — all concentrated in coding and agentic tasks — the 24 reflects data sparsity, not necessarily model quality.
On the benchmarks that do exist, M2.7 is competitive with or better than GPT-5.4 mini. SWE-bench Pro 56.22 and Terminal-Bench 57 are strong numbers. But without GPQA, HLE, AIME, MMMU-Pro, or instruction-following scores, it's impossible to rank M2.7 fairly against models with full coverage.
This is a recurring problem in AI benchmarking. As we covered in Are AI Benchmarks Reliable?, benchmark coverage and provenance matter as much as the scores themselves. A model with 10 strong scores and 20 unknowns is a riskier choice than a model with 25 moderate scores.
The practical takeaway: If your workload is coding or agentic tasks, MiniMax M2.7's published scores justify trying it. For general-purpose use, stick with models that have full benchmark coverage until more M2.7 data is available.
Three takeaways from this week's releases:
1. The reasoning gap is closing at the bottom. GPT-5.4 mini and nano bring reasoning-class capability to the budget tier. A year ago, reasoning models started at $2.50/M input. Now you can get HLE 37.7 for $0.20/M input.
2. Chinese models keep punching above on coding. MiniMax M2.7 posting the highest SWE-bench Pro score in the budget tier — above both GPT-5.4 mini and nano — continues the trend of Chinese labs producing strong coding models at aggressive price points.
3. Budget doesn't mean weak anymore. GPT-5.4 mini's OSWorld-Verified 72.1 would have been a frontier-class score 12 months ago. The models that cost $0.30–$0.75/M input today are materially better than the $15/M models of early 2025.
Check the BenchLM.ai leaderboard for the latest scores as more benchmarks roll in for these models. Prices and capabilities shift fast — what's budget today is obsolete tomorrow.
Get weekly benchmark updates in your inbox.
Free. No spam. Unsubscribe anytime. We only store derived location metadata for consent routing.
On this page
New model releases, benchmark scores, and leaderboard changes. Every Friday.
Free. Your signup is stored with a derived country code for compliance routing.
Which AI model is best for coding in 2026? We rank major LLMs by SWE-bench Pro and LiveCodeBench, with SWE-bench Verified shown as a historical baseline — plus pricing and use-case guidance.
Full LLM API pricing comparison for 2026 — input/output token costs for GPT-5, Claude, Gemini, DeepSeek, Grok, and more. Find the cheapest model for your use case.
Claude Opus 4.6 vs GPT-5.4 head-to-head: benchmark scores, pricing, and when to use each. GPT-5.4 leads on 16 of 20 benchmarks at 6x lower cost. But Claude holds real advantages in some areas.