Reasoning models tend to be the most expensive tier — they use chain-of-thought, produce more output tokens, and are priced accordingly. This ranking divides each model's weighted reasoning score by output token price, revealing which models deliver the best abstract reasoning, long-context comprehension, and multi-step logic per dollar. For applications that need strong reasoning without frontier-model budgets, the value leaders here are worth serious consideration.
Unless noted otherwise, ranking surfaces on this page use BenchLM's provisional leaderboard lane rather than the stricter sourced-only verified leaderboard.
Bottom line: Reasoning models are expensive — chain-of-thought generates more output tokens. GPT-4.1 nano and Gemini 3.1 Flash-Lite offer the best reasoning per dollar.
According to BenchLM.ai, GPT-4.1 nano leads this ranking with a score of 166.72, followed by Gemini 3.1 Flash-Lite (149.7) and Gemini 2.5 Flash (74.78). There is a significant gap between the leading models and the rest of the field.
The best open-weight option is DeepSeek Coder 2.0 (ranked #4 with a score of 54.71). While proprietary models lead, open-weight options are within striking distance for teams willing to trade a few points of performance for full model control.
This ranking is based on provisional weighted averages across the scoring benchmarks in reasoning tracked by BenchLM.ai. For detailed model profiles, click any model name below. To compare two specific models head-to-head, use the "vs #" links.
GPT-4.1 nano
OpenAI · 1M
Score: 66.7 · $0.4/1M
Best reasoning value. Strong reasoning scores at $0.40/1M output.
Gemini 3.1 Flash-Lite
Google · 1M
Score: 59.9 · $0.4/1M
Nearly tied on reasoning value. Extremely low cost.
Gemini 2.5 Flash
Google · 1M
Score: 44.9 · $0.6/1M
Good reasoning value with solid all-around performance.
GPT-4.1 nano leads reasoning value — best reasoning capability per dollar.
Gemini 3.1 Flash-Lite close second on reasoning value at the lowest price point.
Gemini 2.5 Flash good reasoning value with broader capabilities.
Get notified when models move. One email a week with what changed and why.
Free. No spam. Unsubscribe anytime.
The best value model is GPT-4.1 nano by OpenAI with a provisional Score/$ ratio of 166.72 (score: 66.7, output: $0.4/1M tokens).
The best open-weight model is DeepSeek Coder 2.0 at position #4.
30 models are included in this ranking.
Value scores divide the weighted reasoning score by output token price (per 1M tokens). Higher means more capability per dollar. Models with no listed price are excluded.
Value rankings favor cheap models even if absolute performance is modest. A model scoring half as well at one-tenth the price wins on value — but may not meet your quality bar. Always check raw scores alongside value rankings.
For engineers, researchers, and the plain curious — a weekly brief on new models, ranking shifts, and pricing changes.
Free. No spam. Unsubscribe anytime.