AIME and HMMT are high school math olympiad competitions now used to benchmark AI. Frontier models score 95-99% — competition math is effectively solved. Here's what that means.
The American Invitational Mathematics Examination (AIME) and Harvard-MIT Mathematics Tournament (HMMT) are prestigious math competitions designed for the most talented high school students. They've become standard AI benchmarks — and the results are striking.
Frontier models now score 95-99% on these competitions. Competition-level math is, for practical purposes, solved by AI.
AIME is a 15-question, 3-hour examination. Each answer is an integer from 000 to 999. The problems require creative mathematical insight across algebra, geometry, number theory, and combinatorics.
In human competition, qualifying for AIME (scoring well on AMC 10/12) puts a student in the top ~5% nationally. Scoring well on AIME itself puts them in contention for the US Math Olympiad. A perfect score on AIME is exceptionally rare — in most years, fewer than a handful of students achieve it.
What makes AIME challenging is that problems rarely require advanced mathematical knowledge. Instead, they demand creative problem-solving: seeing non-obvious connections, applying techniques in novel ways, and constructing multi-step proofs under time pressure. This is precisely why AIME became popular as an AI benchmark — it tests genuine mathematical reasoning rather than rote knowledge.
We track three years: AIME 2023, AIME 2024, and AIME 2025. Tracking multiple years helps detect whether models have memorized specific problem sets or possess generalizable math ability.
HMMT is hosted jointly by Harvard and MIT and is one of the most competitive high school math tournaments in the US. Problems span algebra, geometry, combinatorics, and number theory at a difficulty level comparable to or exceeding AIME.
Unlike AIME, HMMT includes team rounds that require collaborative problem-solving. For AI benchmarking purposes, we use the individual round problems. HMMT problems tend to be slightly harder than AIME on average, with more emphasis on proof-like reasoning and multi-step deductions.
We track: HMMT 2023, HMMT 2024, HMMT 2025.
The numbers speak for themselves:
| Model | AIME 2025 | HMMT 2025 |
|---|---|---|
| GPT-5.4 | 98 | 96 |
| GPT-5.3 Codex | 98 | 96 |
| Claude Opus 4.6 | 97 | 96 |
| DeepSeek R1 | 96 | 90 |
The top models are all above 95 on AIME 2025 and above 90 on HMMT 2025. The gaps between models are just 1-2 points — within noise range.
Competition math benchmarks followed a predictable arc. In 2023, the best models scored around 50-60% on AIME — impressive, but far from human expert performance. By mid-2024, reasoning-enhanced models like o1 and Claude 3.5 pushed scores into the 80s. By early 2025, scores crossed 90%, and by 2026, the benchmark is effectively saturated.
This rapid progression illustrates a broader pattern in AI benchmarks: once models develop the right reasoning capabilities, scores compress quickly. The 50-to-95 jump happened in roughly 18 months.
One concern with tracking the same competition across years is that older problems may appear in training data. A model that scores 98 on AIME 2023 but only 85 on AIME 2025 might have memorized older problems. By tracking three consecutive years, BenchLM.ai lets you spot this pattern. In practice, frontier models score consistently across all three years, suggesting genuine mathematical ability rather than memorization.
When we say competition math is "solved," we mean that AI models can reliably answer these problems at or above the level of the best human competitors. The 1-2 point differences between models aren't meaningful — they likely reflect minor variations in prompting and sampling rather than genuine capability differences.
However, "solved" has important caveats:
For benchmarks that still show meaningful separation between models on mathematical reasoning, look at:
The most informative math benchmark for comparing frontier models is arguably MATH-500, where the broader difficulty range means scores aren't all clustered at 95+.
If your use case involves mathematical reasoning — whether for education, research, or engineering — here's what the data tells you:
Competition math benchmarks were immensely valuable during the 2023-2024 period when they could differentiate between models. In 2026, they've served their purpose — the ceiling has been reached. The frontier of AI mathematical evaluation has moved to harder problems, broader coverage, and more nuanced reasoning tasks.
See all math scores on our math rankings page, or compare specific models on our comparison pages.
Data from BenchLM.ai. Last updated March 2026.
On this page
New model releases, benchmark scores, and leaderboard changes. Every Friday.
Free. Unsubscribe anytime.
We ranked every major LLM by coding benchmarks — HumanEval, SWE-bench Verified, and LiveCodeBench. Here's which model actually comes out on top, and why the answer depends on what you're building.
Chatbot Arena ranks AI models using Elo ratings from blind human preference votes. Here's how the system works, what the scores mean, and how Elo compares to standardized benchmarks.
A direct benchmark comparison of Anthropic's Claude Opus 4.6 and OpenAI's GPT-5.4 across 22 tests. We break down where each model leads and where benchmarks stop telling the full story.