Reasoning models tend to be the most expensive tier — they use chain-of-thought, produce more output tokens, and are priced accordingly. This ranking divides each model's weighted reasoning score by output token price, revealing which models deliver the best abstract reasoning, long-context comprehension, and multi-step logic per dollar. For applications that need strong reasoning without frontier-model budgets, the value leaders here are worth serious consideration.
According to BenchLM.ai, GPT-4.1 nano leads this ranking with a score of 185.23, followed by Gemini 3.1 Flash-Lite (168.5) and Gemini 2.5 Flash (103.56). There is a significant gap between the leading models and the rest of the field.
The best open-weight option is DeepSeek Coder 2.0 (ranked #5 with a score of 66.48). While proprietary models lead, open-weight options are within striking distance for teams willing to trade a few points of performance for full model control.
This ranking is based on weighted averages across the scoring benchmarks in reasoning tracked by BenchLM.ai. For detailed model profiles, click any model name below. To compare two specific models head-to-head, use the "vs #" links.
Get notified when new models drop, benchmark scores change, or the leaderboard shifts. One email per week.
Free. No spam. Unsubscribe anytime. We only store derived location metadata for consent routing.