What math benchmarks still differentiate frontier AI models in 2026?

BRUMO 2025 (Bulgarian Mathematical Olympiad) and MATH-500 still show meaningful separation between frontier models. All AIME and HMMT variants are now display-only on BenchLM.ai because they are too saturated at the frontier. For the hardest math reasoning, HLE (Humanity's Last Exam) includes advanced mathematical content where top models score only 10-46%.

AIME & HMMT: Can AI Models Do Competition Math?

Q: What is AIME and how is it used to benchmark AI?

AIME (American Invitational Mathematics Examination) is a prestigious US high school math competition with 15 integer-answer problems. It requires creative mathematical insight across algebra, geometry, number theory, and combinatorics. AI models are tested on AIME because it demands genuine mathematical reasoning rather than rote knowledge — the same property that made it hard for humans makes it a good AI benchmark.

Q: What do AI models score on AIME 2025?

As of March 2026, GPT-5.4 and GPT-5.3 Codex both score 98 on AIME 2025. Claude Opus 4.6 scores 97. DeepSeek R1 scores 96. The top models are all above 95 — competition math is effectively solved at the frontier.

Q: Is AIME still a useful AI benchmark in 2026?

AIME is largely saturated for frontier model comparison — the top 5 models are all above 95, with only 1-2 point differences that fall within noise range. BRUMO 2025 and MATH-500 still show more separation between frontier models. AIME 2025 and HMMT 2025 remain useful as display benchmarks and floor checks for mid-tier models, but they no longer factor into BenchLM.ai's weighted math score.

Q: What is HMMT and how does it compare to AIME?

HMMT (Harvard-MIT Mathematics Tournament) is a high school math competition jointly hosted by Harvard and MIT. Problems are comparable to or harder than AIME, with emphasis on proof-like reasoning and multi-step deductions. Frontier AI models score 90-96 on HMMT 2025, slightly lower than AIME 2025, suggesting HMMT still provides marginally more discrimination.

Frontier AI models now score 95-99% on AIME and HMMT — competition math is effectively solved. The top 5 models are within 2 points of each other on both benchmarks. For comparing frontier models on math in 2026, BRUMO and MATH-500 provide more signal. AIME and HMMT remain useful as display benchmarks and floor checks for mid-tier models, but BenchLM.ai no longer weights them into the math score.

The American Invitational Mathematics Examination (AIME) and Harvard-MIT Mathematics Tournament (HMMT) are prestigious math competitions designed for the most talented high school students. They've become standard AI benchmarks — and the results are striking.

Frontier models now score 95-99% on these competitions. Competition-level math is, for practical purposes, solved by AI.

AIME: What it tests

AIME is a 15-question, 3-hour examination. Each answer is an integer from 000 to 999. The problems require creative mathematical insight across algebra, geometry, number theory, and combinatorics.

In human competition, qualifying for AIME puts a student in the top ~5% nationally. A perfect score is exceptionally rare — in most years, fewer than a handful of students achieve it.

What makes AIME challenging is that problems rarely require advanced mathematical knowledge. Instead, they demand creative problem-solving: seeing non-obvious connections, applying techniques in novel ways, and constructing multi-step proofs. This is precisely why AIME became popular as an AI benchmark — it tests genuine mathematical reasoning.

We track three years: AIME 2023, AIME 2024, and AIME 2025. Tracking multiple years helps detect whether models memorized specific problem sets or have generalizable math ability.

HMMT: What it tests

HMMT is hosted jointly by Harvard and MIT and is one of the most competitive high school math tournaments in the US. Problems span algebra, geometry, combinatorics, and number theory at a difficulty comparable to or exceeding AIME, with more emphasis on proof-like reasoning and multi-step deductions.

We track: HMMT 2023, HMMT 2024, HMMT 2025.

Current scores

Model	AIME 2025	HMMT 2025
GPT-5.4	98	96
GPT-5.3 Codex	98	96
Claude Opus 4.6	97	96
DeepSeek R1	96	90

The top models are all above 95 on AIME 2025 and above 90 on HMMT 2025. The gaps between models are just 1-2 points — within noise range.

The path to saturation

Competition math benchmarks followed a predictable arc. In 2023, the best models scored around 50-60% on AIME. By mid-2024, reasoning-enhanced models pushed scores into the 80s. By early 2025, scores crossed 90%, and by 2026, the benchmark is effectively saturated.

This rapid progression illustrates a broader pattern: once models develop the right reasoning capabilities, scores compress quickly. The 50-to-95 jump happened in roughly 18 months.

Year-over-year contamination risk

A model that scores 98 on AIME 2023 but only 85 on AIME 2025 might have memorized older problems. By tracking three consecutive years, BenchLM.ai lets you spot this pattern. In practice, frontier models score consistently across all three years, suggesting genuine mathematical ability.

What "solved" means

When we say competition math is "solved," we mean AI models can reliably answer these problems at or above the level of the best human competitors. The 1-2 point differences between frontier models aren't meaningful.

However, some caveats apply:

Chain-of-thought helps enormously: Without it, even frontier models score 10-20 points lower on AIME. The reasoning process itself is critical.
Formatting matters: Models sometimes produce correct reasoning but format the final integer answer incorrectly.
Verification is different from generation: Models scoring 98 can solve problems but can't always explain why their approach works at the level a human mathematician would.

Benchmarks that still differentiate

For meaningful separation between frontier models on math:

BRUMO 2025 — Bulgarian Mathematical Olympiad, slightly more separation
MATH-500 — Broader difficulty range, more variance in mid-tier models
HLE — Includes advanced math at frontier difficulty (top models score 10-46%)

→ See all math models ranked · Full leaderboard

The bottom line

Competition math benchmarks served their purpose during 2023-2024 when they could differentiate models. In 2026, AIME and HMMT are floor checks, not differentiators. For comparing frontier models on math, check BRUMO 2025 and MATH-500.

Frequently asked questions

What is AIME and how is it used to benchmark AI? AIME is a 15-question US high school math competition requiring creative mathematical reasoning — not rote knowledge. AI models are tested on it because it was hard for the same reasons that made it a good AI discriminator. Frontier models now score 95-99%.

What do AI models score on AIME 2025? GPT-5.4 and GPT-5.3 Codex both score 98, Claude Opus 4.6 scores 97, DeepSeek R1 scores 96. The top models are all above 95 — competition math is effectively solved at the frontier. See the AIME 2025 leaderboard.

Is AIME still a useful AI benchmark in 2026? Largely saturated for frontier comparison — top 5 models are within 2 points. BRUMO 2025 and MATH-500 show more separation. AIME 2025 remains useful as a display benchmark and floor check for mid-tier models, but it no longer affects BenchLM.ai's weighted math ranking.

What is HMMT and how does it compare to AIME? HMMT (Harvard-MIT Mathematics Tournament) tests problems comparable to or harder than AIME, with more proof-like reasoning. Frontier models score 90-96 on HMMT 2025, slightly lower than AIME 2025. See HMMT 2025 leaderboard.

What math benchmarks still differentiate frontier models in 2026? BRUMO 2025, MATH-500, and HLE (includes hard math questions where top models score 10-46%). All AIME and HMMT variants are now display-only on BenchLM.ai because they are saturated at the frontier.

Data from BenchLM.ai. Last updated March 2026.

AIME & HMMT: Can AI Models Do Competition Math?

AIME: What it tests

HMMT: What it tests

Current scores

The path to saturation

Year-over-year contamination risk

What "solved" means

Benchmarks that still differentiate

The bottom line

Frequently asked questions

Don't miss the next GPT moment

Related Posts

React Native Evals: The Mobile App Coding Benchmark Explained

BrowseComp Explained: How We Measure Web Research Agents

OSWorld-Verified Explained: How We Measure Computer-Use Models

Stay ahead of the LLM curve