Math

Math Benchmarks

Mathematical reasoning and problem solving

AIME 2023 · AIME 2024 · AIME 2025 · HMMT Feb 2023 · HMMT Feb 2024 · HMMT Feb 2025 · BRUMO 2025 · MATH-500

Math benchmarks test whether AI models can solve competition-level mathematics problems requiring creative insight, multi-step reasoning, and formal proof construction. Mathematics carries a 5% weight in BenchLM.ai's overall scoring system.

A key challenge with math benchmarks in 2026 is saturation. Frontier models score 95-99% on AIME and HMMT — competition math is effectively solved by AI. The 1-2 point differences between top models are within noise range. BRUMO and MATH-500 still show more meaningful separation, particularly among mid-tier and open-weight models.

Reasoning-enhanced models (those using chain-of-thought) consistently outperform standard models on math by 10-20 points. If mathematical reasoning is critical for your use case, prioritize models with explicit reasoning capabilities. See our math rankings for the full leaderboard, or read our AIME & HMMT explainer.

124 models
1
GPT-5.4 Pro
OpenAI
ClosedReasoning1.05M9199%99%99%96%98%97%97%99%
2
GPT-5.2 Pro
OpenAI
ClosedReasoning400K9099%99%99%96%98%97%97%99%
3
GPT-5.4
OpenAI
ClosedReasoning1.05M9099%99%99%96%98%97%97%99%
4
GPT-5.3 Codex
OpenAI
ClosedReasoning400K8999%99%98%95%97%96%96%99%
5
GPT-5.2
OpenAI
ClosedReasoning400K8899%99%98%95%97%96%96%98%
6
GPT-5.3 Instant
OpenAI
ClosedReasoning128K8799%99%98%95%97%96%96%98%
7
GPT-5.3-Codex-Spark
OpenAI
ClosedReasoning256K8798%98%97%94%96%95%95%98%
8
Claude Opus 4.6
Anthropic
ClosedStandard1M8599%99%98%95%97%96%96%98%
9
GPT-5.2 Instant
OpenAI
ClosedReasoning128K8599%99%98%95%97%96%96%98%
10
GPT-5.2-Codex
OpenAI
ClosedReasoning400K8599%99%98%95%97%96%96%94%
11
Gemini 3.1 Pro
Google
ClosedStandard1M8499%99%98%95%97%96%96%97%
12
GPT-5.1-Codex-Max
OpenAI
ClosedReasoning400K8499%99%98%95%97%96%96%93%
13
Grok 4.1
xAI
ClosedStandard1M8499%99%98%95%97%96%96%97%
14
Gemini 3 Pro Deep Think
Google
ClosedReasoning2M8199%99%98%95%97%96%96%92%
15
GPT-5.1
OpenAI
ClosedReasoning200K8099%99%98%95%97%96%96%94%
16
GPT-5 (high)
OpenAI
ClosedReasoning128K7995%97%96%91%93%92%94%94%
17
Claude Sonnet 4.6
Anthropic
ClosedStandard200K7899%99%98%95%97%96%96%91%
18
GLM-5 (Reasoning)
Zhipu AI
OpenReasoning200K7898%99%98%94%96%95%96%92%
19
GPT-5 (medium)
OpenAI
ClosedReasoning128K7893%95%94%89%91%90%92%92%
20
Claude Opus 4.5
Anthropic
ClosedStandard200K7799%99%98%95%97%96%96%89%
21
Gemini 3 Pro
Google
ClosedStandard2M7799%99%98%95%97%96%96%91%
22
o1-preview
OpenAI
ClosedReasoning200K7794%96%95%90%92%91%93%94%
23
Claude Sonnet 4.5
Anthropic
ClosedStandard200K7697%99%98%93%95%94%96%88%
24
Grok 4.1 Fast
xAI
ClosedStandard1M7696%98%97%92%94%93%95%89%
25
Kimi K2.5 (Reasoning)
Moonshot AI
ClosedReasoning128K7694%96%95%90%92%91%93%92%
Showing 25 of 124

Math benchmark updates

Get notified when MATH, GSM8K, or math reasoning scores change.

Free. No spam. Unsubscribe anytime. We only store derived location metadata for consent routing.

About Math Benchmarks

High school mathematics competition