MATH-500 Problem Set (MATH-500)

A curated subset of 500 problems from the MATH dataset, covering algebra, counting and probability, geometry, intermediate algebra, number theory, prealgebra, and precalculus.

About MATH-500

Year

2021

Tasks

500 problems

Format

Free-form mathematical answers

Difficulty

High school to undergraduate

MATH-500 is one of the most widely cited math benchmarks. It is nearing saturation with top reasoning models scoring 96-99%, making it less useful for differentiating frontier models but still a standard baseline.

Measuring Mathematical Problem Solving With the MATH Dataset

Leaderboard (88 models)

#1GPT-5.3 Codex
99
#2GPT-5.4
99
#3GPT-5.2
98
#4Claude Opus 4.6
98
#5Gemini 3.1 Pro
97
#6Grok 4.1
97
#7GPT-5.2-Codex
94
#8GPT-5.1
94
#9GPT-5 (high)
94
#10o1-preview
94
#14GLM-5 (Reasoning)
92
#15GPT-5 (medium)
92
#16Kimi K2.5 (Reasoning)
92
#17Claude Sonnet 4.6
91
#18Gemini 3 Pro
91
#19DeepSeekMath V2
90
#20MiMo-V2-Flash
90
#21Claude Opus 4.5
89
#23o3-pro
89
#24Claude Sonnet 4.5
88
#25o3
88
#26GPT-5 mini
85
#27GLM-4.7
85
#28GLM-4.7-Flash
85
#30Qwen2.5-72B
84
#31Gemini 2.5 Pro
84
#32o4-mini (high)
84
#34Grok 4
83
#35Qwen2.5-1M
83
#36DeepSeek LLM 2.0
83
#38GLM-5
82
#39Kimi K2.5
82
#41Mistral Large 2
82
#42DeepSeek Coder 2.0
81
#43DeepSeek V3.2
81
#44Qwen3.5 397B
81
#45Claude 4.1 Opus
81
#46Claude 4 Sonnet
81
#47MiniMax M2.5
81
#48Claude Haiku 4.5
81
#49Mistral Large 3
80
#50Gemini 3 Flash
80
#51Claude 3.5 Sonnet
80
#52GPT-4o
80
#55Mistral 8x7B
73
#56Gemini 1.5 Pro
73
#57Claude 3 Opus
73
#59Z-1
73
#60Gemini 1.0 Pro
72
#61Moonshot v1
72
#62Gemini 2.5 Flash
72
#64GPT-4 Turbo
71
#65Nemotron-4 15B
71
#66Llama 3 70B
71
#67Claude 3 Haiku
71
#68GPT-OSS 120B
71
#69DeepSeek-R1
64
#74Mistral 7B v0.3
60
#76Qwen2.5-VL-32B
59
#78Nova Pro
59
#79DeepSeek V3.1
59
#80GPT-OSS 20B
59
#81Mistral 8x7B v0.2
59
#82Llama 4 Scout
57
#83Qwen3 235B 2507
57
#84GLM-4.5
57
#85MiniMax M1 80k
57
#86GLM-4.5-Air
57
#87Kimi K2
57
#88Gemma 3 27B
56

FAQ

What does MATH-500 measure?

A curated subset of 500 problems from the MATH dataset, covering algebra, counting and probability, geometry, intermediate algebra, number theory, prealgebra, and precalculus.

Which model scores highest on MATH-500?

GPT-5.3 Codex by OpenAI currently leads with a score of 99 on MATH-500.

How many models are evaluated on MATH-500?

88 AI models have been evaluated on MATH-500 on BenchLM.