Find the most cost-effective AI model. Each dot is an LLM plotted by its benchmark score (higher is better) against output token price (lower is better). Models on the efficiency frontier offer the best value at their price point.
Gemini 3.1 Flash-Lite
Score/$: 140.0 · $0.40/1M out
GPT-5.4 Pro
Score: 92 · $180.00/1M out
Gemini 3.1 Flash-Lite
Score: 56 · $0.40/1M out
Ranked by Score/$ ratio (benchmark score per dollar of output token cost)
| # | Model | Score | Output $/1M | Score/$ |
|---|---|---|---|---|
| 1 | Gemini 3.1 Flash-Lite | 56 | $0.40 | 140.0 |
| 2 | GPT-4.1 nano OpenAI | 44 | $0.40 | 110.0 |
| 3 | GPT-4o mini OpenAI | 54 | $0.60 | 90.0 |
| 4 | Gemini 2.5 Flash | 50 | $0.60 | 83.3 |
| 5 | DeepSeek Coder 2.0 DeepSeek | 62 | $1.10 | 56.4 |
| 6 | GPT-5.4 nano OpenAI | 58 | $1.25 | 46.4 |
| 7 | DeepSeek V3 DeepSeek | 49 | $1.10 | 44.5 |
| 8 | GPT-4.1 mini OpenAI | 57 | $1.60 | 35.6 |
| 9 | Kimi K2.5 Moonshot AI | 72 | $2.80 | 25.7 |
| 10 | Gemini 3 Flash | 67 | $3.00 | 22.3 |
Compare all LLM API prices side by side
Cost-adjusted coding model rankings
Cost-adjusted agentic model rankings
This chart plots each AI model by its benchmark score (vertical axis) against its API output price per million tokens (horizontal axis). Models in the upper-left quadrant offer the best value — high performance at low cost. The efficiency frontier line connects the best-value models at each price point.
The efficiency frontier (Pareto frontier) connects models where no other model offers both a higher score and a lower price. Models on this line represent the optimal price-performance tradeoff. If a model is below and to the right of the frontier, there exists a cheaper model with a better score.
Currently, Gemini 3.1 Flash-Lite by Google offers the best overall value with a Score/$ ratio of 140.0. This means you get 140.0 benchmark points per dollar of output token cost.
Overall scores are a normalized weighted average across 8 benchmark categories: agentic (22%), coding (20%), reasoning (17%), knowledge (12%), multimodal (12%), multilingual (7%), instruction following (5%), and math (5%). Category scores use weighted averages of individual benchmarks within each category. Only models with verified benchmark data are included.
Get notified when new models drop, benchmark scores change, or the leaderboard shifts. One email per week.
Free. No spam. Unsubscribe anytime. We only store derived location metadata for consent routing.