Skip to main content
benchmarksmathaimehmmtexplainer

AIME & HMMT: Can AI Models Do Competition Math?

AIME and HMMT are high school math olympiad competitions now used to benchmark AI. Frontier models score 95-99% — competition math is effectively solved. Here's what that means.

Glevd·Published March 7, 2026·10 min read

Share This Report

Copy the link, post it, or save a PDF version.

Share on XShare on LinkedIn

Frontier AI models now score 95-99% on AIME and HMMT — competition math is effectively solved. The top 5 models are within 2 points of each other on both benchmarks. For comparing frontier models on math in 2026, BRUMO and MATH-500 provide more signal. AIME and HMMT remain useful as display benchmarks and floor checks for mid-tier models, but BenchLM.ai no longer weights them into the math score.

The American Invitational Mathematics Examination (AIME) and Harvard-MIT Mathematics Tournament (HMMT) are prestigious math competitions designed for the most talented high school students. They've become standard AI benchmarks — and the results are striking.

Frontier models now score 95-99% on these competitions. Competition-level math is, for practical purposes, solved by AI.

AIME: What it tests

AIME is a 15-question, 3-hour examination. Each answer is an integer from 000 to 999. The problems require creative mathematical insight across algebra, geometry, number theory, and combinatorics.

In human competition, qualifying for AIME puts a student in the top ~5% nationally. A perfect score is exceptionally rare — in most years, fewer than a handful of students achieve it.

What makes AIME challenging is that problems rarely require advanced mathematical knowledge. Instead, they demand creative problem-solving: seeing non-obvious connections, applying techniques in novel ways, and constructing multi-step proofs. This is precisely why AIME became popular as an AI benchmark — it tests genuine mathematical reasoning.

We track three years: AIME 2023, AIME 2024, and AIME 2025. Tracking multiple years helps detect whether models memorized specific problem sets or have generalizable math ability.

HMMT: What it tests

HMMT is hosted jointly by Harvard and MIT and is one of the most competitive high school math tournaments in the US. Problems span algebra, geometry, combinatorics, and number theory at a difficulty comparable to or exceeding AIME, with more emphasis on proof-like reasoning and multi-step deductions.

We track: HMMT 2023, HMMT 2024, HMMT 2025.

Current scores

Model AIME 2025 HMMT 2025
GPT-5.4 98 96
GPT-5.3 Codex 98 96
Claude Opus 4.6 97 96
DeepSeek R1 96 90

The top models are all above 95 on AIME 2025 and above 90 on HMMT 2025. The gaps between models are just 1-2 points — within noise range.

The path to saturation

Competition math benchmarks followed a predictable arc. In 2023, the best models scored around 50-60% on AIME. By mid-2024, reasoning-enhanced models pushed scores into the 80s. By early 2025, scores crossed 90%, and by 2026, the benchmark is effectively saturated.

This rapid progression illustrates a broader pattern: once models develop the right reasoning capabilities, scores compress quickly. The 50-to-95 jump happened in roughly 18 months.

Year-over-year contamination risk

A model that scores 98 on AIME 2023 but only 85 on AIME 2025 might have memorized older problems. By tracking three consecutive years, BenchLM.ai lets you spot this pattern. In practice, frontier models score consistently across all three years, suggesting genuine mathematical ability.

What "solved" means

When we say competition math is "solved," we mean AI models can reliably answer these problems at or above the level of the best human competitors. The 1-2 point differences between frontier models aren't meaningful.

However, some caveats apply:

  • Chain-of-thought helps enormously: Without it, even frontier models score 10-20 points lower on AIME. The reasoning process itself is critical.
  • Formatting matters: Models sometimes produce correct reasoning but format the final integer answer incorrectly.
  • Verification is different from generation: Models scoring 98 can solve problems but can't always explain why their approach works at the level a human mathematician would.

Benchmarks that still differentiate

For meaningful separation between frontier models on math:

  • BRUMO 2025 — Bulgarian Mathematical Olympiad, slightly more separation
  • MATH-500 — Broader difficulty range, more variance in mid-tier models
  • HLE — Includes advanced math at frontier difficulty (top models score 10-46%)

See all math models ranked · Full leaderboard

The bottom line

Competition math benchmarks served their purpose during 2023-2024 when they could differentiate models. In 2026, AIME and HMMT are floor checks, not differentiators. For comparing frontier models on math, check BRUMO 2025 and MATH-500.


Frequently asked questions

What is AIME and how is it used to benchmark AI? AIME is a 15-question US high school math competition requiring creative mathematical reasoning — not rote knowledge. AI models are tested on it because it was hard for the same reasons that made it a good AI discriminator. Frontier models now score 95-99%.

What do AI models score on AIME 2025? GPT-5.4 and GPT-5.3 Codex both score 98, Claude Opus 4.6 scores 97, DeepSeek R1 scores 96. The top models are all above 95 — competition math is effectively solved at the frontier. See the AIME 2025 leaderboard.

Is AIME still a useful AI benchmark in 2026? Largely saturated for frontier comparison — top 5 models are within 2 points. BRUMO 2025 and MATH-500 show more separation. AIME 2025 remains useful as a display benchmark and floor check for mid-tier models, but it no longer affects BenchLM.ai's weighted math ranking.

What is HMMT and how does it compare to AIME? HMMT (Harvard-MIT Mathematics Tournament) tests problems comparable to or harder than AIME, with more proof-like reasoning. Frontier models score 90-96 on HMMT 2025, slightly lower than AIME 2025. See HMMT 2025 leaderboard.

What math benchmarks still differentiate frontier models in 2026? BRUMO 2025, MATH-500, and HLE (includes hard math questions where top models score 10-46%). All AIME and HMMT variants are now display-only on BenchLM.ai because they are saturated at the frontier.


Data from BenchLM.ai. Last updated March 2026.

New models drop every week. We send one email a week with what moved and why.