Math Benchmarks
Mathematical reasoning and problem solving
AIME 2023 · AIME 2024 · AIME 2025 · HMMT Feb 2023 · HMMT Feb 2024 · HMMT Feb 2025 · BRUMO 2025 · MATH-500
Math benchmarks test whether AI models can solve competition-level mathematics problems requiring creative insight, multi-step reasoning, and formal proof construction. Mathematics carries a 5% weight in BenchLM.ai's overall scoring system.
A key challenge with math benchmarks in 2026 is saturation. Frontier models score 95-99% on AIME and HMMT — competition math is effectively solved by AI. The 1-2 point differences between top models are within noise range. BRUMO and MATH-500 still show more meaningful separation, particularly among mid-tier and open-weight models.
Reasoning-enhanced models (those using chain-of-thought) consistently outperform standard models on math by 10-20 points. If mathematical reasoning is critical for your use case, prioritize models with explicit reasoning capabilities. See our math rankings for the full leaderboard, or read our AIME & HMMT explainer.
1 GPT-5.4 Pro OpenAI | Closed | Reasoning | 1.05M | 91 | 99% | 99% | 99% | 96% | 98% | 97% | 97% | 99% |
2 GPT-5.2 Pro OpenAI | Closed | Reasoning | 400K | 90 | 99% | 99% | 99% | 96% | 98% | 97% | 97% | 99% |
3 GPT-5.4 OpenAI | Closed | Reasoning | 1.05M | 90 | 99% | 99% | 99% | 96% | 98% | 97% | 97% | 99% |
4 GPT-5.3 Codex OpenAI | Closed | Reasoning | 400K | 89 | 99% | 99% | 98% | 95% | 97% | 96% | 96% | 99% |
5 GPT-5.2 OpenAI | Closed | Reasoning | 400K | 88 | 99% | 99% | 98% | 95% | 97% | 96% | 96% | 98% |
6 GPT-5.3 Instant OpenAI | Closed | Reasoning | 128K | 87 | 99% | 99% | 98% | 95% | 97% | 96% | 96% | 98% |
7 GPT-5.3-Codex-Spark OpenAI | Closed | Reasoning | 256K | 87 | 98% | 98% | 97% | 94% | 96% | 95% | 95% | 98% |
8 Claude Opus 4.6 Anthropic | Closed | Standard | 1M | 85 | 99% | 99% | 98% | 95% | 97% | 96% | 96% | 98% |
9 GPT-5.2 Instant OpenAI | Closed | Reasoning | 128K | 85 | 99% | 99% | 98% | 95% | 97% | 96% | 96% | 98% |
10 GPT-5.2-Codex OpenAI | Closed | Reasoning | 400K | 85 | 99% | 99% | 98% | 95% | 97% | 96% | 96% | 94% |
11 Gemini 3.1 Pro Google | Closed | Standard | 1M | 84 | 99% | 99% | 98% | 95% | 97% | 96% | 96% | 97% |
12 GPT-5.1-Codex-Max OpenAI | Closed | Reasoning | 400K | 84 | 99% | 99% | 98% | 95% | 97% | 96% | 96% | 93% |
13 Grok 4.1 xAI | Closed | Standard | 1M | 84 | 99% | 99% | 98% | 95% | 97% | 96% | 96% | 97% |
14 Gemini 3 Pro Deep Think Google | Closed | Reasoning | 2M | 81 | 99% | 99% | 98% | 95% | 97% | 96% | 96% | 92% |
15 GPT-5.1 OpenAI | Closed | Reasoning | 200K | 80 | 99% | 99% | 98% | 95% | 97% | 96% | 96% | 94% |
16 GPT-5 (high) OpenAI | Closed | Reasoning | 128K | 79 | 95% | 97% | 96% | 91% | 93% | 92% | 94% | 94% |
17 Claude Sonnet 4.6 Anthropic | Closed | Standard | 200K | 78 | 99% | 99% | 98% | 95% | 97% | 96% | 96% | 91% |
18 GLM-5 (Reasoning) Zhipu AI | Open | Reasoning | 200K | 78 | 98% | 99% | 98% | 94% | 96% | 95% | 96% | 92% |
19 GPT-5 (medium) OpenAI | Closed | Reasoning | 128K | 78 | 93% | 95% | 94% | 89% | 91% | 90% | 92% | 92% |
20 Claude Opus 4.5 Anthropic | Closed | Standard | 200K | 77 | 99% | 99% | 98% | 95% | 97% | 96% | 96% | 89% |
21 Gemini 3 Pro Google | Closed | Standard | 2M | 77 | 99% | 99% | 98% | 95% | 97% | 96% | 96% | 91% |
22 o1-preview OpenAI | Closed | Reasoning | 200K | 77 | 94% | 96% | 95% | 90% | 92% | 91% | 93% | 94% |
23 Claude Sonnet 4.5 Anthropic | Closed | Standard | 200K | 76 | 97% | 99% | 98% | 93% | 95% | 94% | 96% | 88% |
24 Grok 4.1 Fast xAI | Closed | Standard | 1M | 76 | 96% | 98% | 97% | 92% | 94% | 93% | 95% | 89% |
25 Kimi K2.5 (Reasoning) Moonshot AI | Closed | Reasoning | 128K | 76 | 94% | 96% | 95% | 90% | 92% | 91% | 93% | 92% |
Math benchmark updates
Get notified when MATH, GSM8K, or math reasoning scores change.
Free. No spam. Unsubscribe anytime. We only store derived location metadata for consent routing.
About Math Benchmarks
High school mathematics competition