Best Value LLM for Math in 2026 — Cost-Adjusted Rankings

Math is the most saturated benchmark category — top models all score 95%+ on competition math. That makes price the main differentiator for math-heavy workloads. This ranking divides each model's weighted math score by output token price. If you need strong math reasoning (AIME, BRUMO, MATH-500) and the top 10 models all deliver similar accuracy, the value ranking here helps you pick the most cost-effective option.

Unless noted otherwise, ranking surfaces on this page use BenchLM's provisional leaderboard lane rather than the stricter sourced-only verified leaderboard.

Bottom line: Math is the most saturated category — top models all score 95%+. That makes price the main differentiator. Gemini 3.1 Flash-Lite and DeepSeek Coder 2.0 lead on value.

According to BenchLM.ai, Grok 4.1 Fast leads this ranking with a score of 187.43, followed by DeepSeek V3.2 (167.76) and Mistral Large 3 (42.82). There is a significant gap between the leading models and the rest of the field.

The best open-weight option is DeepSeek V3.2 (ranked #2 with a score of 167.76). Open-weight models are highly competitive in this category — self-hosting is a viable alternative to proprietary APIs.

This ranking is based on provisional weighted averages across the scoring benchmarks in math tracked by BenchLM.ai. For detailed model profiles, click any model name below. To compare two specific models head-to-head, use the "vs #" links.

1Closed

Grok 4.1 Fast

xAI · 1M

187.43Score/$

Score: 93.7 · $0.5/1M

2Open

DeepSeek V3.2

DeepSeek · 128K

167.76Score/$

Score: 70.5 · $0.42/1M

3Closed

Mistral Large 3

Mistral · 128K

42.82Score/$

Score: 64.2 · $1.5/1M

What changed

Gemini 3.1 Flash-Lite leads math value — decent math scores at the lowest price.

DeepSeek Coder 2.0 strong raw math (71) at very low cost.

Gemini 2.5 Flash good math value with broader capabilities.

How to choose

Math on the tightest budget?

Gemini 3.1 Flash-Lite — lowest cost

Strong math at good value?

DeepSeek Coder 2.0 — raw math (71) at low cost

Best raw math regardless of cost?

See the standard math leaderboard

Full Rankings (45 models)

Grok 4.1 Fast

xAI·Proprietary·1M

187.43

Score/$

Score: 93.7 · $0.5/1M

vs #2

DeepSeek V3.2

DeepSeek·Open Weight·128K

167.76

Score/$

Score: 70.5 · $0.42/1M

vs #3

Mistral Large 3

Mistral·Proprietary·128K

42.82

Score/$

Score: 64.2 · $1.5/1M

vs #4

Grok Code Fast 1

xAI·Proprietary·256K

30.42

Score/$

Score: 45.6 · $1.5/1M

vs #5

GLM-5 (Reasoning)

Z.AI·Open Weight·200K

28.86

Score/$

Score: 92.3 · $3.2/1M

vs #6

Gemini 3.1 Flash-Lite

Google·Proprietary·1M

28.86

Score/$

Score: 43.3 · $1.5/1M

vs #7

GLM-5

Z.AI·Open Weight·200K

28.53

Score/$

Score: 91.3 · $3.2/1M

vs #8

Claude 3 Haiku

Anthropic·Proprietary·200K

27.79

Score/$

Score: 34.7 · $1.25/1M

vs #9

Qwen3.5 397B (Reasoning)

Alibaba·Open Weight·128K

25.64

Score/$

Score: 92.3 · $3.6/1M

vs #10

DeepSeek V3.2 (Thinking)

DeepSeek·Open Weight·128K

23.75

Score/$

Score: 52 · $2.19/1M

vs #11

Kimi K2.5 (Reasoning)

Moonshot AI·Proprietary·128K

22.39

Score/$

Score: 67.2 · $3/1M

vs #12

Kimi K2

Moonshot AI·Proprietary·128K

20.53

Score/$

Score: 51.3 · $2.5/1M

vs #13

GLM-5.1

Z.AI·Open Weight·203K

20.43

Score/$

Score: 89.9 · $4.4/1M

vs #14

Qwen3.5 397B

Alibaba·Open Weight·128K

20.37

Score/$

Score: 73.3 · $3.6/1M

vs #15

Kimi K2.5

Moonshot AI·Open Weight·256K

18.93

Score/$

Score: 56.8 · $3/1M

vs #16

Gemini 3 Flash

Google·Proprietary·1M

17.44

Score/$

Score: 52.3 · $3/1M

vs #17

DeepSeek-R1

DeepSeek·Open Weight·128K

15.54

Score/$

Score: 34 · $2.19/1M

vs #18

Gemini 2.5 Flash

Google·Proprietary·1M

11.18

Score/$

Score: 27.9 · $2.5/1M

vs #19

Claude Haiku 4.5

Anthropic·Proprietary·200K

Score/$

Score: 55 · $5/1M

vs #20

OpenAI·Proprietary·200K

10.42

Score/$

Score: 83.4 · $8/1M

vs #21

GPT-5.1-Codex-Max

OpenAI·Proprietary·400K

9.72

Score/$

Score: 97.2 · $10/1M

vs #22

Gemini 1.5 Pro

Google·Proprietary·2M

9.13

Score/$

Score: 45.6 · $5/1M

vs #23

Gemini 2.5 Pro

Google·Proprietary·1M

7.36

Score/$

Score: 73.6 · $10/1M

vs #24

GPT-5.3 Codex

OpenAI·Proprietary·400K

7.14

Score/$

Score: 100 · $14/1M

vs #25

GPT-5 (high)

OpenAI·Proprietary·128K

7.04

Score/$

Score: 70.4 · $10/1M

vs #26

GPT-5.2-Codex

OpenAI·Proprietary·400K

6.98

Score/$

Score: 97.7 · $14/1M

vs #27

GPT-5.1

OpenAI·Proprietary·200K

6.83

Score/$

Score: 68.3 · $10/1M

vs #28

Gemini 3 Pro

Google·Proprietary·2M

6.73

Score/$

Score: 80.8 · $12/1M

vs #29

GPT-5.4

OpenAI·Proprietary·1.05M

6.29

Score/$

Score: 94.4 · $15/1M

vs #30

Claude Sonnet 4.5

Anthropic·Proprietary·200K

5.84

Score/$

Score: 87.7 · $15/1M

vs #31

GPT-5.2

OpenAI·Proprietary·400K

5.79

Score/$

Score: 81.1 · $14/1M

vs #32

Gemini 3.1 Pro

Google·Proprietary·1M

5.61

Score/$

Score: 67.3 · $12/1M

vs #33

GPT-4o

OpenAI·Proprietary·128K

5.2

Score/$

Score: 52 · $10/1M

vs #34

Claude Sonnet 4.6

Anthropic·Proprietary·200K

5.06

Score/$

Score: 75.9 · $15/1M

vs #35

Claude 4 Sonnet

Anthropic·Proprietary·200K

4.08

Score/$

Score: 61.1 · $15/1M

vs #36

Claude Opus 4.5

Anthropic·Proprietary·200K

3.8

Score/$

Score: 94.9 · $25/1M

vs #37

GLM-4.5

Z.AI·Proprietary·128K

3.59

Score/$

Score: 7.9 · $2.2/1M

vs #38

Claude Opus 4.6

Anthropic·Proprietary·1M

3.45

Score/$

Score: 86.3 · $25/1M

vs #39

GLM-4.5-Air

Z.AI·Proprietary·128K

3.41

Score/$

Score: 3.7 · $1.1/1M

vs #40

Claude 3.5 Sonnet

Anthropic·Proprietary·200K

3.39

Score/$

Score: 50.8 · $15/1M

vs #41

o1-preview

OpenAI·Proprietary·200K

1.57

Score/$

Score: 94.1 · $60/1M

vs #42

GPT-4 Turbo

OpenAI·Proprietary·128K

1.32

Score/$

Score: 39.6 · $30/1M

vs #43

o3-pro

OpenAI·Proprietary·200K

1.08

Score/$

Score: 86.4 · $80/1M

vs #44

Claude 4.1 Opus

Anthropic·Proprietary·200K

0.86

Score/$

Score: 64.8 · $75/1M

vs #45

Claude 3 Opus

Anthropic·Proprietary·200K

0.56

Score/$

Score: 42 · $75/1M

These rankings update weekly

Get notified when models move. One email a week with what changed and why.

Free. No spam. Unsubscribe anytime.

Key Takeaways

The best value model is Grok 4.1 Fast by xAI with a provisional Score/$ ratio of 187.43 (score: 93.7, output: $0.5/1M tokens).

The best open-weight model is DeepSeek V3.2 at position #2.

45 models are included in this ranking.

Score in Context

What these scores mean

Value scores divide the weighted math score by output token price (per 1M tokens). Higher means more capability per dollar. Models with no listed price are excluded.

Known limitations

Value rankings favor cheap models even if absolute performance is modest. A model scoring half as well at one-tenth the price wins on value — but may not meet your quality bar. Always check raw scores alongside value rankings.

Explore More

Price vs Performance Chart Compare Pricing Which LLM Should I Use? Benchmark Explainers

Last updated: May 20, 2026

The AI models change fast. We track them for you.

For engineers, researchers, and the plain curious — a weekly brief on new models, ranking shifts, and pricing changes.

Free. No spam. Unsubscribe anytime.