benchmarksmathaimehmmtexplainer

AIME & HMMT: Can AI Models Do Competition Math?

AIME and HMMT are high school math olympiad competitions now used to benchmark AI. Frontier models score 95-99% — competition math is effectively solved. Here's what that means.

Glevd·March 7, 2026·10 min read

The American Invitational Mathematics Examination (AIME) and Harvard-MIT Mathematics Tournament (HMMT) are prestigious math competitions designed for the most talented high school students. They've become standard AI benchmarks — and the results are striking.

Frontier models now score 95-99% on these competitions. Competition-level math is, for practical purposes, solved by AI.

AIME: What it tests

AIME is a 15-question, 3-hour examination. Each answer is an integer from 000 to 999. The problems require creative mathematical insight across algebra, geometry, number theory, and combinatorics.

In human competition, qualifying for AIME (scoring well on AMC 10/12) puts a student in the top ~5% nationally. Scoring well on AIME itself puts them in contention for the US Math Olympiad. A perfect score on AIME is exceptionally rare — in most years, fewer than a handful of students achieve it.

What makes AIME challenging is that problems rarely require advanced mathematical knowledge. Instead, they demand creative problem-solving: seeing non-obvious connections, applying techniques in novel ways, and constructing multi-step proofs under time pressure. This is precisely why AIME became popular as an AI benchmark — it tests genuine mathematical reasoning rather than rote knowledge.

We track three years: AIME 2023, AIME 2024, and AIME 2025. Tracking multiple years helps detect whether models have memorized specific problem sets or possess generalizable math ability.

HMMT: What it tests

HMMT is hosted jointly by Harvard and MIT and is one of the most competitive high school math tournaments in the US. Problems span algebra, geometry, combinatorics, and number theory at a difficulty level comparable to or exceeding AIME.

Unlike AIME, HMMT includes team rounds that require collaborative problem-solving. For AI benchmarking purposes, we use the individual round problems. HMMT problems tend to be slightly harder than AIME on average, with more emphasis on proof-like reasoning and multi-step deductions.

We track: HMMT 2023, HMMT 2024, HMMT 2025.

Current scores

The numbers speak for themselves:

Model AIME 2025 HMMT 2025
GPT-5.4 98 96
GPT-5.3 Codex 98 96
Claude Opus 4.6 97 96
DeepSeek R1 96 90

The top models are all above 95 on AIME 2025 and above 90 on HMMT 2025. The gaps between models are just 1-2 points — within noise range.

The path to saturation

Competition math benchmarks followed a predictable arc. In 2023, the best models scored around 50-60% on AIME — impressive, but far from human expert performance. By mid-2024, reasoning-enhanced models like o1 and Claude 3.5 pushed scores into the 80s. By early 2025, scores crossed 90%, and by 2026, the benchmark is effectively saturated.

This rapid progression illustrates a broader pattern in AI benchmarks: once models develop the right reasoning capabilities, scores compress quickly. The 50-to-95 jump happened in roughly 18 months.

Year-over-year contamination risk

One concern with tracking the same competition across years is that older problems may appear in training data. A model that scores 98 on AIME 2023 but only 85 on AIME 2025 might have memorized older problems. By tracking three consecutive years, BenchLM.ai lets you spot this pattern. In practice, frontier models score consistently across all three years, suggesting genuine mathematical ability rather than memorization.

What "solved" means

When we say competition math is "solved," we mean that AI models can reliably answer these problems at or above the level of the best human competitors. The 1-2 point differences between models aren't meaningful — they likely reflect minor variations in prompting and sampling rather than genuine capability differences.

However, "solved" has important caveats:

  • Formatting matters: Models sometimes produce correct mathematical reasoning but format the final integer answer incorrectly. A score of 97 vs 98 may reflect formatting issues, not reasoning gaps.
  • Chain-of-thought helps enormously: Without chain-of-thought prompting, even frontier models score 10-20 points lower on AIME. The reasoning process itself is critical.
  • Verification is different from generation: Models that score 98 on AIME can solve the problems, but they can't always explain why their approach works at the level a human mathematician would.

Benchmarks that still differentiate

For benchmarks that still show meaningful separation between models on mathematical reasoning, look at:

  • BRUMO 2025 — Bulgarian Mathematical Olympiad, slightly more separation between frontier models
  • MATH-500 — Standard benchmark covering a broader range of difficulty levels, also nearing saturation at the top but with more variance in mid-tier models

The most informative math benchmark for comparing frontier models is arguably MATH-500, where the broader difficulty range means scores aren't all clustered at 95+.

How to use math benchmarks when choosing an LLM

If your use case involves mathematical reasoning — whether for education, research, or engineering — here's what the data tells you:

  1. Any frontier model works for standard math: If you need algebra, calculus, or statistics, every model scoring 90+ on AIME will handle it well.
  2. Check MATH-500 for harder problems: For graduate-level mathematics or novel proofs, MATH-500 scores show more differentiation.
  3. Don't pick a model based on AIME alone: A 97 vs 98 on AIME tells you nothing meaningful. Look at coding benchmarks or reasoning benchmarks where gaps are wider.
  4. Reasoning models have an edge: Models with explicit chain-of-thought capabilities (o3, o4-mini, DeepSeek R1) tend to be more reliable on multi-step math than their non-reasoning counterparts.

The bottom line

Competition math benchmarks were immensely valuable during the 2023-2024 period when they could differentiate between models. In 2026, they've served their purpose — the ceiling has been reached. The frontier of AI mathematical evaluation has moved to harder problems, broader coverage, and more nuanced reasoning tasks.

See all math scores on our math rankings page, or compare specific models on our comparison pages.


Data from BenchLM.ai. Last updated March 2026.

Weekly LLM Updates

New model releases, benchmark scores, and leaderboard changes. Every Friday.

Free. Unsubscribe anytime.