FrontierMath

Name: FrontierMath
Creator: BenchLM

An expert-level mathematical reasoning benchmark by Epoch AI featuring original, research-level problems created by mathematicians including IMO gold medalists and Fields Medal recipients. Problems require deep creativity and multi-step reasoning.

Top models on FrontierMath — May 22, 2026

As of May 22, 2026, GPT-5.5 Pro leads the FrontierMath leaderboard with 52.4% , followed by GPT-5.5 (51.7%) and GPT-5.4 Pro (50%).

1Closed

GPT-5.5 Pro

OpenAI

52.4%

Overall —Context 1M

2Closed

GPT-5.5

OpenAI

51.7%

Overall 91Context 1M

3Closed

GPT-5.4 Pro

OpenAI

50%

Overall 91Context 1.05M

4 modelsMath35% of category scoreRefreshingUpdated May 22, 2026

According to BenchLM.ai, GPT-5.5 Pro leads the FrontierMath benchmark with a score of 52.4%, followed by GPT-5.5 (51.7%) and GPT-5.4 Pro (50%). The top models are clustered within 2.4 points, suggesting this benchmark is nearing saturation for frontier models.

4 models have been evaluated on FrontierMath. The benchmark falls in the Math category. This category carries a 5% weight in BenchLM.ai's overall scoring system. Within that category, FrontierMath contributes 35% of the category score, so strong performance here directly affects a model's overall ranking.

About FrontierMath

Year

2024

Tasks

350 original research-level math problems

Format

Open-ended mathematical reasoning with tool access

Difficulty

Research-level mathematics

FrontierMath is the hardest public math benchmark. It consists of 300 Tier 1-3 problems and 50 Tier 4 problems, all original and unpublished. Models are evaluated with access to Python and computational tools. Top models score under 50%, making it a critical discriminator for frontier mathematical reasoning.

FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI

BenchLM freshness & provenance

Version

FrontierMath 2024

Refresh cadence

Annual

Staleness state

Refreshing

Question availability

Public benchmark set

Refreshing

BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.