Math is the most saturated benchmark category — top models all score 95%+ on competition math. That makes price the main differentiator for math-heavy workloads. This ranking divides each model's weighted math score by output token price. If you need strong math reasoning (AIME, BRUMO, MATH-500) and the top 10 models all deliver similar accuracy, the value ranking here helps you pick the most cost-effective option.
Unless noted otherwise, ranking surfaces on this page use BenchLM's provisional leaderboard lane rather than the stricter sourced-only verified leaderboard.
Bottom line: Math is the most saturated category — top models all score 95%+. That makes price the main differentiator. Gemini 3.1 Flash-Lite and DeepSeek Coder 2.0 lead on value.
According to BenchLM.ai, Gemini 3.1 Flash-Lite leads this ranking with a score of 108.22, followed by DeepSeek Coder 2.0 (64.45) and Gemini 2.5 Flash (47.66). There is a significant gap between the leading models and the rest of the field.
The best open-weight option is DeepSeek Coder 2.0 (ranked #2 with a score of 64.45). Open-weight models are highly competitive in this category — self-hosting is a viable alternative to proprietary APIs.
This ranking is based on provisional weighted averages across the scoring benchmarks in math tracked by BenchLM.ai. For detailed model profiles, click any model name below. To compare two specific models head-to-head, use the "vs #" links.
Gemini 3.1 Flash-Lite
Google · 1M
Score: 43.3 · $0.4/1M
Best math value. Lowest cost for math workloads.
DeepSeek Coder 2.0
DeepSeek · 128K
Score: 70.9 · $1.1/1M
Best raw math per dollar among serious models.
Gemini 2.5 Flash
Google · 1M
Score: 28.6 · $0.6/1M
Good math value with solid all-around performance.
Gemini 3.1 Flash-Lite leads math value — decent math scores at the lowest price.
DeepSeek Coder 2.0 strong raw math (71) at very low cost.
Gemini 2.5 Flash good math value with broader capabilities.
Get notified when models move. One email a week with what changed and why.
Free. No spam. Unsubscribe anytime.
The best value model is Gemini 3.1 Flash-Lite by Google with a provisional Score/$ ratio of 108.22 (score: 43.3, output: $0.4/1M tokens).
The best open-weight model is DeepSeek Coder 2.0 at position #2.
23 models are included in this ranking.
Value scores divide the weighted math score by output token price (per 1M tokens). Higher means more capability per dollar. Models with no listed price are excluded.
Value rankings favor cheap models even if absolute performance is modest. A model scoring half as well at one-tenth the price wins on value — but may not meet your quality bar. Always check raw scores alongside value rankings.
For engineers, researchers, and the plain curious — a weekly brief on new models, ranking shifts, and pricing changes.
Free. No spam. Unsubscribe anytime.