AIME and HMMT are high school math olympiad competitions now used to benchmark AI. Frontier models score 95-99% — competition math is effectively solved. Here's what that means.
Share This Report
Copy the link, post it, or save a PDF version.
Frontier AI models now score 95-99% on AIME and HMMT — competition math is effectively solved. The top 5 models are within 2 points of each other on both benchmarks. For comparing frontier models on math in 2026, BRUMO and MATH-500 provide more signal. AIME and HMMT remain useful as display benchmarks and floor checks for mid-tier models, but BenchLM.ai no longer weights them into the math score.
The American Invitational Mathematics Examination (AIME) and Harvard-MIT Mathematics Tournament (HMMT) are prestigious math competitions designed for the most talented high school students. They've become standard AI benchmarks — and the results are striking.
Frontier models now score 95-99% on these competitions. Competition-level math is, for practical purposes, solved by AI.
AIME is a 15-question, 3-hour examination. Each answer is an integer from 000 to 999. The problems require creative mathematical insight across algebra, geometry, number theory, and combinatorics.
In human competition, qualifying for AIME puts a student in the top ~5% nationally. A perfect score is exceptionally rare — in most years, fewer than a handful of students achieve it.
What makes AIME challenging is that problems rarely require advanced mathematical knowledge. Instead, they demand creative problem-solving: seeing non-obvious connections, applying techniques in novel ways, and constructing multi-step proofs. This is precisely why AIME became popular as an AI benchmark — it tests genuine mathematical reasoning.
We track three years: AIME 2023, AIME 2024, and AIME 2025. Tracking multiple years helps detect whether models memorized specific problem sets or have generalizable math ability.
HMMT is hosted jointly by Harvard and MIT and is one of the most competitive high school math tournaments in the US. Problems span algebra, geometry, combinatorics, and number theory at a difficulty comparable to or exceeding AIME, with more emphasis on proof-like reasoning and multi-step deductions.
We track: HMMT 2023, HMMT 2024, HMMT 2025.
| Model | AIME 2025 | HMMT 2025 |
|---|---|---|
| GPT-5.4 | 98 | 96 |
| GPT-5.3 Codex | 98 | 96 |
| Claude Opus 4.6 | 97 | 96 |
| DeepSeek R1 | 96 | 90 |
The top models are all above 95 on AIME 2025 and above 90 on HMMT 2025. The gaps between models are just 1-2 points — within noise range.
Competition math benchmarks followed a predictable arc. In 2023, the best models scored around 50-60% on AIME. By mid-2024, reasoning-enhanced models pushed scores into the 80s. By early 2025, scores crossed 90%, and by 2026, the benchmark is effectively saturated.
This rapid progression illustrates a broader pattern: once models develop the right reasoning capabilities, scores compress quickly. The 50-to-95 jump happened in roughly 18 months.
A model that scores 98 on AIME 2023 but only 85 on AIME 2025 might have memorized older problems. By tracking three consecutive years, BenchLM.ai lets you spot this pattern. In practice, frontier models score consistently across all three years, suggesting genuine mathematical ability.
When we say competition math is "solved," we mean AI models can reliably answer these problems at or above the level of the best human competitors. The 1-2 point differences between frontier models aren't meaningful.
However, some caveats apply:
For meaningful separation between frontier models on math:
→ See all math models ranked · Full leaderboard
Competition math benchmarks served their purpose during 2023-2024 when they could differentiate models. In 2026, AIME and HMMT are floor checks, not differentiators. For comparing frontier models on math, check BRUMO 2025 and MATH-500.
What is AIME and how is it used to benchmark AI? AIME is a 15-question US high school math competition requiring creative mathematical reasoning — not rote knowledge. AI models are tested on it because it was hard for the same reasons that made it a good AI discriminator. Frontier models now score 95-99%.
What do AI models score on AIME 2025? GPT-5.4 and GPT-5.3 Codex both score 98, Claude Opus 4.6 scores 97, DeepSeek R1 scores 96. The top models are all above 95 — competition math is effectively solved at the frontier. See the AIME 2025 leaderboard.
Is AIME still a useful AI benchmark in 2026? Largely saturated for frontier comparison — top 5 models are within 2 points. BRUMO 2025 and MATH-500 show more separation. AIME 2025 remains useful as a display benchmark and floor check for mid-tier models, but it no longer affects BenchLM.ai's weighted math ranking.
What is HMMT and how does it compare to AIME? HMMT (Harvard-MIT Mathematics Tournament) tests problems comparable to or harder than AIME, with more proof-like reasoning. Frontier models score 90-96 on HMMT 2025, slightly lower than AIME 2025. See HMMT 2025 leaderboard.
What math benchmarks still differentiate frontier models in 2026? BRUMO 2025, MATH-500, and HLE (includes hard math questions where top models score 10-46%). All AIME and HMMT variants are now display-only on BenchLM.ai because they are saturated at the frontier.
Data from BenchLM.ai. Last updated March 2026.
New models drop every week. We send one email a week with what moved and why.
Share This Report
Copy the link, post it, or save a PDF version.
On this page
Which models moved up, what’s new, and what it costs. One email a week, 3-min read.
Free. One email per week.
React Native Evals measures whether AI coding models can complete real React Native implementation tasks across navigation, animation, and async state. Here's what it tests, why it matters, and how it differs from SWE-bench and LiveCodeBench.
BrowseComp evaluates whether AI models can search the web, gather evidence, and answer research questions instead of relying only on latent knowledge.
OSWorld-Verified measures whether AI models can operate software interfaces and complete multi-step computer tasks with reliability.