Math is the most saturated benchmark category — top models all score 95%+ on competition math. That makes price the main differentiator for math-heavy workloads. This ranking divides each model's weighted math score by output token price. If you need strong math reasoning (AIME, BRUMO, MATH-500) and the top 10 models all deliver similar accuracy, the value ranking here helps you pick the most cost-effective option.
According to BenchLM.ai, Gemini 3.1 Flash-Lite leads this ranking with a score of 162.62, followed by Gemini 2.5 Flash (88.77) and DeepSeek Coder 2.0 (73.68). There is a significant gap between the leading models and the rest of the field.
The best open-weight option is DeepSeek Coder 2.0 (ranked #3 with a score of 73.68). Open-weight models are highly competitive in this category — self-hosting is a viable alternative to proprietary APIs.
This ranking is based on weighted averages across the scoring benchmarks in math tracked by BenchLM.ai. For detailed model profiles, click any model name below. To compare two specific models head-to-head, use the "vs #" links.
Get notified when new models drop, benchmark scores change, or the leaderboard shifts. One email per week.
Free. No spam. Unsubscribe anytime. We only store derived location metadata for consent routing.