Skip to main content

MATH-500 Problem Set (MATH-500)

A curated subset of 500 problems from the MATH dataset, covering algebra, counting and probability, geometry, intermediate algebra, number theory, prealgebra, and precalculus.

How BenchLM shows MATH-500 right now

BenchLM is tracking MATH-500 in the local dataset, but exact-source verification records for these rows are still being attached. To avoid a blank benchmark page, BenchLM shows the current tracked rows below as a display-only reference table.

These tracked rows are useful for inspection and spot-checking, but until exact-source attachments are completed they should not be treated as fully verified public benchmark rows.

118 tracked modelsLocal tracked rowsAwaiting exact-source attachmentsDisplay only

Tracked score on MATH-500 — April 21, 2026

BenchLM mirrors the published tracked score view for MATH-500. GPT-5.3 Codex leads the public snapshot at 99% , followed by GPT-5.4 (99%) and GPT-5.2 Pro (99%). BenchLM does not use these results to rank models overall.

118 modelsMath15% of category scoreStaleUpdated April 21, 2026

The published MATH-500 snapshot is tightly clustered at the top: GPT-5.3 Codex sits at 99%, while the third row is only 0.0 points behind. The broader top-10 spread is 1.2 points, so many of the published scores sit in a relatively narrow band.

118 models have been evaluated on MATH-500. The benchmark falls in the Math category. This category carries a 5% weight in BenchLM.ai's overall scoring system. Within that category, MATH-500 contributes 15% of the category score, so strong performance here directly affects a model's overall ranking.

About MATH-500

Year

2021

Tasks

500 problems

Format

Free-form mathematical answers

Difficulty

High school to undergraduate

MATH-500 is one of the most widely cited math benchmarks. It is nearing saturation with top reasoning models scoring 96-99%, making it less useful for differentiating frontier models but still a standard baseline.

BenchLM freshness & provenance

Version

MATH-500 2021

Refresh cadence

Static

Staleness state

Stale

Question availability

Public benchmark set

Stale

BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.

Tracked score table (118 models)

1
GPT-5.3 Codexgpt-5-3-codex
99%
2
GPT-5.4gpt-5-4
99%
3
GPT-5.2 Progpt-5-2-pro
99%
4
Sarvam 105Bsarvam-105b
98.6%
5
Claude Opus 4.6claude-opus-4-6
98%
6
GPT-5.2gpt-5-2
98%
7
GPT-5.3 Instantgpt-5-3-instant
98%
8
GPT-5.2 Instantgpt-5-2-instant
98%
9
GPT-5.3-Codex-Sparkgpt-5-3-codex-spark
98%
10
Claude Sonnet 4.6claude-sonnet-4-6
97.8%
11
GLM-5.1glm-5-1
97.4%
12
GLM-5glm-5
97.4%
13
GPT-5.4 minigpt-5-4-mini
97.4%
14
Kimi K2kimi-k2
97.4%
15
DeepSeek-R1deepseek-r1
97.3%
16
Grok 4.1grok-4-1
97%
17
Gemini 3.1 Progemini-3-1-pro
97%
18
Sarvam 30Bsarvam-30b
97%
19
MiniMax M1 80kminimax-m1-80k
96.8%
20
Phi-4phi-4
94.6%
21
GPT-5.2-Codexgpt-5-2-codex
94%
22
GPT-5 (high)gpt-5-high
94%
23
94%
24
GPT-5.1gpt-5-1
94%
25
Mistral Large 3mistral-large-3
93.6%
26
GPT-5.1-Codex-Maxgpt-5-1-codex-max
93%
27
Qwen3.5 397B (Reasoning)qwen3-5-397b-reasoning
93%
28
Gemini 3 Pro Deep Thinkgemini-3-pro-deep-think
92%
29
GPT-5 (medium)gpt-5-medium
92%
30
GLM-5 (Reasoning)glm-5-reasoning
92%
31
Kimi K2.5 (Reasoning)kimi-k2-5-reasoning
92%
32
Gemini 3 Progemini-3-pro
91%
33
Mistral Medium 3mistral-medium-3
91%
34
DeepSeek V3deepseek-v3
90.2%
35
MiMo-V2-Flashmimo-v2-flash
90%
36
DeepSeekMath V2deepseekmath-v2
90%
37
Grok 4.1 Fastgrok-4-1-fast
89%
38
Claude Opus 4.5claude-opus-4-5
89%
39
89%
40
Claude Sonnet 4.5claude-sonnet-4-5
88%
41
88%
42
Claude 4.1 Opus Thinkingclaude-4-1-opus-thinking
87%
43
GLM-4.7glm-4-7
85%
44
Step 3.5 Flashstep-3-5-flash
85%
45
GPT-5 minigpt-5-mini
85%
46
GLM-4.7-Flashglm-4-7-flash
85%
47
Nemotron 3 Super 120B A12Bnemotron-3-super-120b-a12b
85%
48
Nemotron 3 Ultra 500Bnemotron-3-ultra-500b
84%
49
Gemini 2.5 Progemini-2-5-pro
84%
50
DeepSeek V3.2 (Thinking)deepseek-v3-2-thinking
84%
51
Qwen2.5-72Bqwen2-5-72b
84%
52
o4-mini (high)o4-mini-high
84%
53
Qwen2.5-1Mqwen2-5-1m
83%
54
Grok 4grok-4
83%
55
DeepSeek LLM 2.0deepseek-llm-2-0
83%
56
Nemotron 3 Super 100Bnemotron-3-super-100b
83%
57
Ministral 3 14B (Reasoning)ministral-3-14b-reasoning
82.7%
58
Kimi K2.5kimi-k2-5
82%
59
Llama 3.1 405Bllama-3-1-405b
82%
60
Mistral Large 2mistral-large-2
82%
61
Seed 1.6seed-1-6
82%
62
Mercury 2mercury-2
82%
63
Qwen3.5 397Bqwen3-5-397b
81%
64
DeepSeek Coder 2.0deepseek-coder-2-0
81%
65
Claude 4 Sonnetclaude-4-sonnet
81%
66
Claude 4.1 Opusclaude-4-1-opus
81%
67
Claude Haiku 4.5claude-haiku-4-5
81%
68
DeepSeek V3.2deepseek-v3-2
81%
69
Seed-2.0-Liteseed-2-0-lite
81%
70
MiniMax M2.5minimax-m2-5
81%
71
Gemini 3 Flashgemini-3-flash
80%
72
Claude 3.5 Sonnetclaude-3-5-sonnet
80%
73
GPT-4ogpt-4o
80%
74
Nemotron Ultra 253Bnemotron-ultra-253b
74%
75
Grok Code Fast 1grok-code-fast-1
73%
76
Gemini 1.5 Progemini-1-5-pro
73%
77
Claude 3 Opusclaude-3-opus
73%
78
Mistral 8x7Bmistral-8x7b
73%
79
Z-1z-1
73%
80
Nemotron 3 Nano 30Bnemotron-3-nano-30b
73%
81
Gemini 2.5 Flashgemini-2-5-flash
72%
82
Moonshot v1moonshot-v1
72%
83
Gemini 1.0 Progemini-1-0-pro
72%
84
Seed 1.6 Flashseed-1-6-flash
72%
85
Ministral 3 14Bministral-3-14b
72%
86
Gemini 3.1 Flash-Litegemini-3-1-flash-lite
71%
87
GPT-OSS 120Bgpt-oss-120b
71%
88
Claude 3 Haikuclaude-3-haiku
71%
89
GPT-4 Turbogpt-4-turbo
71%
90
Nemotron-4 15Bnemotron-4-15b
71%
91
Llama 3 70Bllama-3-70b
71%
92
Aion-2.0aion-2-0
71%
93
Seed-2.0-Miniseed-2-0-mini
70%
94
Ministral 3 8B (Reasoning)ministral-3-8b-reasoning
67%
95
66%
96
65.8%
97
DeepSeek V3.1 (Reasoning)deepseek-v3-1-reasoning
62%
98
LFM2.5-1.2B-Thinkinglfm2-5-1-2b-thinking
61%
99
Qwen3 235B 2507 (Reasoning)qwen3-235b-2507-reasoning
60%
100
Llama 4 Behemothllama-4-behemoth
60%
101
Mistral 7B v0.3mistral-7b-v0-3
60%
102
Ministral 3 8Bministral-3-8b
60%
103
Grok 3 [Beta]grok-3-beta
59%
104
Llama 4 Maverickllama-4-maverick
59%
105
Nova Pronova-pro
59%
106
DeepSeek V3.1deepseek-v3-1
59%
107
GPT-OSS 20Bgpt-oss-20b
59%
108
Mistral 8x7B v0.2mistral-8x7b-v0-2
59%
109
Ministral 3 3B (Reasoning)ministral-3-3b-reasoning
59%
110
Qwen3 235B 2507qwen3-235b-2507
57%
111
Llama 4 Scoutllama-4-scout
57%
112
GLM-4.5-Airglm-4-5-air
57%
113
GLM-4.5glm-4-5
57%
114
LFM2-24B-A2Blfm2-24b-a2b
57%
115
Gemma 3 27Bgemma-3-27b
56%
116
LFM2.5-1.2B-Instructlfm2-5-1-2b-instruct
54%
117
Ministral 3 3Bministral-3-3b
53%
118
34.4%

FAQ

What does MATH-500 measure?

A curated subset of 500 problems from the MATH dataset, covering algebra, counting and probability, geometry, intermediate algebra, number theory, prealgebra, and precalculus.

Which model leads the published MATH-500 snapshot?

GPT-5.3 Codex currently leads the published MATH-500 snapshot with a tracked score of 99%. BenchLM shows this benchmark for display only and does not use it in overall rankings.

How many models are evaluated on MATH-500?

118 AI models are included in BenchLM's mirrored MATH-500 snapshot, based on the public leaderboard captured on April 21, 2026.

Last updated: April 21, 2026 · mirrored from the public benchmark leaderboard

The AI models change fast. We track them for you.

For engineers, researchers, and the plain curious — a weekly brief on new models, ranking shifts, and pricing changes.

Free. No spam. Unsubscribe anytime.