Math Benchmarks — AIME, HMMT & MATH-500 Leaderboard
Mathematical reasoning and problem solving
Bottom line: Competition math is largely solved by frontier models — AIME and HMMT are saturated. BRUMO and MATH-500 still show meaningful separation.
AIME 2023 · AIME 2024 · AIME 2025 · AIME25 (Arcee) · HMMT Feb 2023 · HMMT Feb 2024 · HMMT Feb 2025 · BRUMO 2025 · MATH-500
Best Math picks
BenchLM summaries for math plus the practical tradeoffs users check next: open weights, price, speed, latency, and context.
Top AI Models for Math — April 2026
As of April 2026, GPT-5.3 Codex leads the provisional math leaderboard with a weighted score of 100.0%, followed by GPT-5.2-Codex (97.7%) and GPT-5.1-Codex-Max (97.2%). BenchLM is currently showing 86 provisional-ranked models and 0 verified-ranked models in this category.
GPT-5.3 Codex
OpenAI
GPT-5.2-Codex
OpenAI
GPT-5.1-Codex-Max
OpenAI
What changed
Claude Mythos Preview leads math with top BRUMO and MATH-500 scores.
GPT-5.4 close second, with near-perfect AIME scores.
Gemini 3.1 Pro strong third — best value option for math-heavy workloads.
How to choose
Top models by benchmark
High school mathematics competition(25% of category score)
Math Leaderboard
Updated April 21, 2026Sorted by math weighted score. Switch between provisional-ranked and verified-ranked modes to see the broader public dataset versus sourced-only ranking. Click column headers to re-sort by overall score or any benchmark.
1 GPT-5.3 Codex OpenAI | 100% | Est.89 | — | — | — | — | — | — | — | — | — |
2 GPT-5.2-Codex OpenAI | 97.7% | Est.79 | — | — | — | — | — | — | — | — | — |
3 GPT-5.1-Codex-Max OpenAI | 97.2% | Est.78 | — | — | — | — | — | — | — | — | — |
4 Gemini 3 Pro Deep Think Google | 96% | Est.86 | — | — | — | — | — | — | — | — | — |
5 Claude Opus 4.5 Anthropic | 94.9% | 80 | — | — | — | — | — | — | — | — | — |
6 | 93.7% | Est.72 | — | — | — | — | — | — | — | — | — |
7 | 93% | Est.84 | — | — | — | — | — | — | — | — | — |
8 GPT-5.4 OpenAI | 92.8% | 93 | — | — | — | — | — | — | — | — | — |
9 Qwen3.5 397B (Reasoning) Alibaba | 92.3% | Est.80 | — | — | — | — | — | — | — | — | — |
10 Grok 4.1 xAI | 91.9% | Est.80 | — | — | — | — | — | — | — | — | — |
11 GPT-5 (medium) OpenAI | 91.7% | Est.73 | — | — | — | — | — | — | — | — | — |
12 GLM-5.1 Z.AI | 90.4% | 84 | — | — | — | — | — | — | — | — | — |
13 Sarvam 105B Sarvam | 90.4% | Est.41 | — | — | — | — | — | — | — | — | — |
14 Claude Opus 4.6 Anthropic | 89.4% | 91 | — | — | — | 99.8% | — | — | — | — | — |
15 GLM-5 Z.AI | 87.7% | 77 | — | — | — | 93.3% | — | — | — | — | — |
16 Claude Sonnet 4.5 Anthropic | 87.7% | Est.67 | — | — | 87% | — | — | — | — | — | — |
17 o3-pro OpenAI | 86.4% | Est.59 | — | — | — | — | — | — | — | — | — |
18 GPT-5.2 OpenAI | 83.7% | 83 | — | — | — | — | — | — | — | — | — |
19 o3 OpenAI | 83.4% | Est.59 | — | — | — | — | — | — | — | — | — |
20 Gemini 3 Pro Google | 83% | Est.83 | — | — | — | — | — | — | — | — | — |
21 o1-preview OpenAI | 82.7% | Est.68 | — | — | — | — | — | — | — | — | — |
22 MiMo-V2-Flash Xiaomi | 82.1% | Est.62 | — | — | 94.1% | — | — | — | — | — | — |
23 Sarvam 30B Sarvam | 81.2% | Est.42 | — | — | — | — | — | — | — | — | — |
24 Grok 4 xAI | 80% | Est.67 | — | — | — | — | — | — | — | — | — |
25 GLM-4.7 Z.AI | 79.8% | Est.71 | — | — | 95.7% | — | — | — | — | — | — |
These rankings update weekly
Get notified when models move. One email a week with what changed and why.
Free. No spam. Unsubscribe anytime.
Score in Context
What these scores mean
Math carries a 5% weight in overall scoring — relatively low because frontier models have saturated the main competition benchmarks. AIME and HMMT scores are 95-99% across top models. The weighted score now relies on BRUMO and MATH-500, which still show meaningful separation.
Known limitations
AIME and HMMT are effectively solved by AI — they are displayed for reference but no longer factor into the weighted score. If math reasoning is critical for your use case, look at BRUMO scores specifically, and consider models with explicit reasoning capabilities (chain-of-thought). See the AIME & HMMT explainer.
How we weight
Mathematics carries a 5% weight in BenchLM.ai's overall scoring. Frontier models score 95-99% on AIME and HMMT — competition math is effectively solved by AI.
AIME and HMMT are still displayed for reference but no longer factor into the weighted score due to saturation. BRUMO and MATH-500 show more meaningful separation. If mathematical reasoning is critical, prioritize models with explicit reasoning capabilities. See the math leaderboard or read the AIME & HMMT explainer.
Leaderboards exclude benchmark rows that BenchLM generated from other scores or cloned from reference models. When a weighted benchmark is missing after that filter, the category falls back to the remaining trustworthy public rows instead of filling the gap with synthetic values.
The full scoring rules, freshness handling, and runtime/pricing caveats live on the BenchLM methodology page.
| Benchmark | Weight | Status | Description |
|---|---|---|---|
| AIME 2023 | — | Display only | High school mathematics competition |
| AIME 2024 | — | Display only | High school mathematics competition |
| AIME 2025 | 25% | Weighted | High school mathematics competition |
| AIME25 (Arcee) | — | Display only | Display-only AIME25 reference from Arcee AI's Trinity-Large-Thinking launch chart. |
| HMMT Feb 2023 | — | Display only | Collegiate mathematics competition |
| HMMT Feb 2024 | — | Display only | Collegiate mathematics competition |
| HMMT Feb 2025 | — | Display only | Collegiate mathematics competition |
| BRUMO 2025 | 25% | Weighted | University-level mathematics olympiad |
| MATH-500 | 15% | Weighted | Curated 500-problem subset of the MATH dataset covering algebra, geometry, number theory, and more |
Math benchmark updates
Math model rankings change weekly. Stay current.
Free. No spam. Unsubscribe anytime.
About Math Benchmarks
High school mathematics competition