A curated subset of 500 problems from the MATH dataset, covering algebra, counting and probability, geometry, intermediate algebra, number theory, prealgebra, and precalculus.
BenchLM is tracking MATH-500 in the local dataset, but exact-source verification records for these rows are still being attached. To avoid a blank benchmark page, BenchLM shows the current tracked rows below as a display-only reference table.
These tracked rows are useful for inspection and spot-checking, but until exact-source attachments are completed they should not be treated as fully verified public benchmark rows.
BenchLM mirrors the published tracked score view for MATH-500. GPT-5.3 Codex leads the public snapshot at 99% , followed by GPT-5.4 (99%) and GPT-5.2 Pro (99%). BenchLM does not use these results to rank models overall.
GPT-5.3 Codex
OpenAI
gpt-5-3-codex
GPT-5.4
OpenAI
gpt-5-4
GPT-5.2 Pro
OpenAI
gpt-5-2-pro
The published MATH-500 snapshot is tightly clustered at the top: GPT-5.3 Codex sits at 99%, while the third row is only 0.0 points behind. The broader top-10 spread is 1.2 points, so many of the published scores sit in a relatively narrow band.
118 models have been evaluated on MATH-500. The benchmark falls in the Math category. This category carries a 5% weight in BenchLM.ai's overall scoring system. Within that category, MATH-500 contributes 15% of the category score, so strong performance here directly affects a model's overall ranking.
Year
2021
Tasks
500 problems
Format
Free-form mathematical answers
Difficulty
High school to undergraduate
MATH-500 is one of the most widely cited math benchmarks. It is nearing saturation with top reasoning models scoring 96-99%, making it less useful for differentiating frontier models but still a standard baseline.
Version
MATH-500 2021
Refresh cadence
Static
Staleness state
Stale
Question availability
Public benchmark set
BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.
A curated subset of 500 problems from the MATH dataset, covering algebra, counting and probability, geometry, intermediate algebra, number theory, prealgebra, and precalculus.
GPT-5.3 Codex currently leads the published MATH-500 snapshot with a tracked score of 99%. BenchLM shows this benchmark for display only and does not use it in overall rankings.
118 AI models are included in BenchLM's mirrored MATH-500 snapshot, based on the public leaderboard captured on April 21, 2026.
For engineers, researchers, and the plain curious — a weekly brief on new models, ranking shifts, and pricing changes.
Free. No spam. Unsubscribe anytime.