Best LLM for Math 2026: AIME, HMMT & MATH-500 Rankings

GPT-5.4 is the best LLM for math 2026, scoring 99 on AIME 2025, 99 on MATH-500, and 97 on BRUMO 2025. GPT-5.2 Pro ties those same rows, GPT-5.3 Codex follows at 98/99/96 across AIME 2025, MATH-500, and BRUMO, and Claude Opus 4.6 is close behind at 98/98/96.

BenchLM tracks AIME 2023/2024/2025, HMMT 2023/2024/2025, BRUMO 2025, and MATH-500 because one year is not enough. A model that aces old competition problems but drops on the 2025 set is a contamination risk, not a math leader. For the live sortable view, use the math leaderboard or the best math LLM hub.

Best LLM for math 2026, ranked

Model	Creator	AIME 2025	HMMT 2025	MATH-500	BRUMO 2025
GPT-5.4	OpenAI	99	97	99	97
GPT-5.2 Pro	OpenAI	99	97	99	97
GPT-5.3 Codex	OpenAI	98	96	99	96
GPT-5.1-Codex-Max	OpenAI	98	96	93	96
GPT-5.2-Codex	OpenAI	98	96	94	96
Grok 4.1	xAI	98	96	97	96
Gemini 3 Pro Deep Think	Google	98	96	92	96
Claude Opus 4.6	Anthropic	98	96	98	96
GLM-5 (Reasoning)	Z.AI	98	95	92	96
Claude Sonnet 4.6	Anthropic	98	96	97.8	96
Gemini 3 Pro	Google	98	96	91	96
Kimi K2.5 (Reasoning)	Moonshot AI	96.1	95.4	92	93
Grok 4.1 Fast	xAI	97	93	89	95
DeepSeek-R1	DeepSeek	45	41	97.3	43
Sarvam 105B	Sarvam	88.3	—	98.6	—

Scores from the BenchLM.ai math leaderboard. Last updated May 2026.

The top of the table is bunched: the strongest proprietary models sit between 96 and 99 on the current competition-style rows. DeepSeek-R1 is the outlier because MATH-500 saturated earlier than AIME, HMMT, and BRUMO; its 97.3 MATH-500 score aged well, but its 2025 competition-math rows did not. Sarvam 105B is worth calling out for MATH-500 specifically, where it scores 98.6 despite thinner math coverage elsewhere.

Why GPT-5.4 leads on math

GPT-5.4 leads because it has the cleanest full-coverage row across the benchmarks most readers can compare directly: 99 on AIME 2025, 97 on HMMT 2025, 99 on MATH-500, and 97 on BRUMO 2025. That combination makes GPT-5.4 the safest default when the task is hard algebra, number theory, probability, geometry, or competition-style reasoning.

The asterisk is real. GPT-5.2 Pro ties every cell in the table, and GPT-5.3 Codex ties MATH-500 while losing by one point on AIME, HMMT, and BRUMO. At this saturation level, those gaps rarely show up in ordinary math workflows. Cost, latency, tooling, and explanation style matter more than a 1-point benchmark edge.

For production math work, GPT-5.3 Codex is the value row. The LLM pricing guide lists it below GPT-5.4 on output cost, while its math scores remain inside the top cluster. GPT-5.4 is the better default when you want one model across math, knowledge, coding, and agentic work; GPT-5.3 Codex is the practical choice when math and symbolic manipulation drive the workload.

The honest caveat: frontier math benchmarks are saturated. A 1- or 2-point gap among the top 10 models is noise until your own evaluation says otherwise. For competition-style problems, several models will work. The use-case sections below matter more than the headline rank, and the AIME/HMMT explainer gives the longer version of why.

GPT-5.5, DeepSeek V4, and next-gen math benchmarks

OpenAI's GPT-5.5 and GPT-5.5 Pro, released April 23, 2026, do not report AIME 2025, HMMT 2025, MATH-500, or BRUMO 2025 in the BenchLM dataset. DeepSeek's V4 Pro series, released April 24, 2026, also skips those older rows. That is the headline shift in AI math benchmarks: labs have moved from saturated tests to FrontierMath, HMMT Feb 2026, IMOAnswerBench, and APEX.

What we know is narrower but more meaningful. GPT-5.5 Pro scores 52.4 on FrontierMath, and GPT-5.5 scores 51.7. DeepSeek V4 Pro Max scores 95.2 on HMMT Feb 2026, 89.8 on IMOAnswerBench, 38.3 on APEX, and 90.2 on APEX Shortlist. DeepSeek V4 Flash Max is close at 94.8, 88.4, and 33 on the first three rows.

Do not compare GPT-5.5 Pro's 52.4 FrontierMath score to GPT-5.4's 99 on MATH-500. They are different tests with different difficulty floors. If you need a model today for practical competition or applied math, GPT-5.4 and Claude Opus 4.6 are safe picks. If you need the capability ceiling on research-grade math, GPT-5.5 Pro's FrontierMath row is the one to watch until full benchmark coverage catches up.

What each AI math benchmark tests

AIME

AIME, the American Invitational Mathematics Examination, is a 15-problem, 3-hour US high school competition with integer answers. It tests creative insight more than advanced notation, which is why it became a useful AIME LLM benchmark in the first place. BenchLM tracks AIME 2023, AIME 2024, and AIME 2025 to catch year-over-year drops that suggest memorization rather than general math ability.

HMMT

HMMT, the Harvard-MIT Mathematics Tournament, is comparable to AIME but more proof-flavored and multi-step. Frontier models now score in the mid-to-high 90s on the latest HMMT row, so it is still useful as a floor check but weak as a frontier separator. BenchLM tracks HMMT 2023, HMMT 2024, and HMMT 2025.

MATH-500

MATH-500 samples 500 problems from the Hendrycks MATH dataset across algebra, geometry, number theory, probability, and precalculus-style reasoning. Its broader difficulty range makes it more useful for mid-tier separation than AIME alone. DeepSeek-R1's 97.3 row is the clearest example: the model remains strong on MATH-500 even though its AIME 2025 and HMMT 2025 scores trail current frontier models badly.

BRUMO 2025

BRUMO 2025, the Bulgarian Mathematical Olympiad row, is one of the newer discriminators in BenchLM's math set. The top models reach 96-97, but mid-tier reasoners drop faster than they do on AIME. Its 2025 date also lowers training-data contamination risk compared with older public problem sets.

FrontierMath, HMMT Feb 2026, IMOAnswerBench, and APEX

The next-gen suite exists because the older benchmarks are crowded at the top. FrontierMath uses research-grade problems written by mathematicians. HMMT Feb 2026 refreshes the competition format with newer problems. IMOAnswerBench uses Olympiad-style answer checking, and APEX exposes much steeper drop-off even for strong reasoners. For the general benchmark methodology, read what benchmarks measure.

Best LLM for math by use case

Competition math: AIME, HMMT, USAMO-style

Best: GPT-5.4 or GPT-5.2 Pro. Both sit at 99 on AIME 2025 and 97 on HMMT 2025. Claude Opus 4.6 and Grok 4.1 are effectively tied for practical use, and GLM-5 (Reasoning) is the strongest open-weight option in the current table. Require step-by-step reasoning for this category; without it, even frontier models can drop 10-20 points.

Engineering and applied math

Best: GPT-5.3 Codex. It combines a 98 AIME 2025 row with 99 on MATH-500 and strong coding-adjacent symbolic work, which matters for calculus, linear algebra, differential equations, and numerical workflows. Claude Opus 4.6 is the second pick when readable derivations matter more than raw throughput. For scientific coding and applied math, also check SciCode.

Finance and quantitative reasoning

Best: GPT-5.4 or Claude Opus 4.6. Both handle probability, expected value, estimation, and model-checking reliably when prompted to show assumptions. For finance, numerical hallucination rate matters more than peak AIME score. Prefer models that expose uncertainty and let you audit intermediate steps instead of aggressive one-shot answers.

Math tutoring and education

Best: Claude Opus 4.6 or Claude Sonnet 4.6. Tutoring is not only a math problem; it is a pedagogy and explanation problem layered on top. Anthropic's models tend to produce clearer step-by-step explanations, while GPT-5.4 is better when the student needs multiple solution paths or a terse verification pass. None of these models are 100% reliable, so verify worked solutions.

Math research and frontier reasoning

Best: GPT-5.5 Pro is the current ceiling on the hardest reported math row, with 52.4 on FrontierMath. GPT-5.5 follows at 51.7. That still means roughly half of FrontierMath problems are unsolved, so the benchmark answer to "can an LLM be a proof co-author?" is closer than before, not solved. For related graduate-level coverage, compare HLE and the HLE explainer.

GPT-5.4 vs Claude Opus 4.6 vs Gemini 3 Pro vs DeepSeek-R1 on math

Model	AIME 2025	HMMT 2025	MATH-500	BRUMO 2025	Source	Notes
GPT-5.4	99	97	99	97	Proprietary	Frontier leader with full coverage
Claude Opus 4.6	98	96	98	96	Proprietary	Best explanation quality among the top rows
Gemini 3 Pro	98	96	91	96	Proprietary	Competitive on AIME/HMMT, weaker on MATH-500
DeepSeek-R1	45	41	97.3	43	Open weight	2024-era reasoner; MATH-500 held up

GPT-5.4

GPT-5.4 is the default pick when peak math capability matters and you want the strongest broad BenchLM row. It is not meaningfully ahead of GPT-5.2 Pro on these math rows, but it is the more current full-coverage flagship.

Claude Opus 4.6

Claude Opus 4.6 is one point behind GPT-5.4 on AIME, HMMT, and BRUMO, and one point behind on MATH-500. That is not the buying reason. The buying reason is explanation quality: Opus is the strongest option here for tutoring, derivations, and readable multi-step work.

Gemini 3 Pro

Gemini 3 Pro is competitive on competition-style rows: 98 on AIME 2025, 96 on HMMT 2025, and 96 on BRUMO. The weak spot is the MATH-500 ranking, where it scores 91 versus 98-99 for OpenAI and Anthropic's best rows. If your workload is broader-difficulty math rather than olympiad-style problems, GPT-5.4 or Claude Opus 4.6 is safer.

DeepSeek-R1

The honest DeepSeek-R1 math take is split. DeepSeek-R1 scores 97.3 on MATH-500, which remains strong. But it scores 45 on AIME 2025, 41 on HMMT 2025, and 43 on BRUMO 2025, so it is well below current frontier competition-math models. If you need an open-weight math model in 2026, start with GLM-5 (Reasoning) or Kimi K2.5 (Reasoning), then compare against the broader best open-source LLM guide.

Math model pricing in 2026

Model	Input ($/M)	Output ($/M)	Notes
GPT-5.4	2.50	15.00	Best full-coverage math row
GPT-5.3 Codex	2.50	10.00	Best value near the frontier
Claude Opus 4.6	5.00	25.00	Premium explanation quality
Claude Sonnet 4.6	3.00	15.00	Tutoring/value Anthropic option
Gemini 3 Pro	—	—	No confirmed row in the pricing guide
GLM-5 (Reasoning)	—	—	Open-weight deployment varies

Pricing from the LLM pricing 2026 guide where confirmed. Prices are per million tokens.

Pricing changes the recommendation more than the top-line math score does. If you are solving a few hard problems, use GPT-5.4 or Claude Opus 4.6. If you are building a high-volume math assistant, GPT-5.3 Codex has enough benchmark headroom to justify starting there before paying for a flagship row.

How BenchLM ranks math models

BenchLM's math score weights the benchmarks that still provide signal: FrontierMath, AIME 2025, BRUMO 2025, and MATH-500. HMMT remains visible because readers recognize it and it is useful as a floor check, but the top end is too compressed for HMMT alone to decide the best AI for math.

Tracking three AIME years matters because it helps detect contamination. A model that scores 99 on AIME 2023 but collapses on AIME 2025 probably memorized older public problems. Frontier models score consistently across the three AIME rows, which is evidence of generalizable math reasoning rather than benchmark recall.

The full scoring model lives on the BenchLM methodology page, and the practical recommendation surface is the best math LLM hub.

Frequently asked questions

What is the best LLM for math in 2026? GPT-5.4 leads the benchmarks readers can compare on directly: 99 on AIME 2025, 99 on MATH-500, and 97 on BRUMO 2025. GPT-5.2 Pro ties on every cell, and GPT-5.3 Codex is the value pick at 98/99/96. OpenAI's newer GPT-5.5 Pro does not report scores on these benchmarks and instead leads FrontierMath at 52.4, a different, harder test. For practical competition-style or applied math, GPT-5.4 or Claude Opus 4.6 is the safe choice today.

Which AI model is best at AIME 2025? GPT-5.4 and GPT-5.2 Pro both score 99 on AIME 2025. Eight other frontier models score 98. The benchmark is saturated, so pick on cost, latency, explanation quality, or platform fit once a model is in the top cluster.

Which AI model is best at MATH-500? GPT-5.4, GPT-5.2 Pro, and GPT-5.3 Codex all score 99 on MATH-500. Claude Opus 4.6 follows at 98, with Claude Sonnet 4.6 at 97.8. Among open-weight models, Sarvam 105B at 98.6 and DeepSeek-R1 at 97.3 are the standout MATH-500 rows.

Is DeepSeek-R1 good at math? DeepSeek-R1 is still good on MATH-500, where it scores 97.3, but it is no longer competitive on current competition-math rows. It scores 45 on AIME 2025, 41 on HMMT 2025, and 43 on BRUMO 2025. For open-weight math in 2026, GLM-5 (Reasoning) and Kimi K2.5 (Reasoning) are stronger picks.

What math model gives the best value in 2026? GPT-5.3 Codex is the best value pick near the frontier. It matches GPT-5.4 on MATH-500 at 99 and is one point behind on AIME 2025, HMMT 2025, and BRUMO 2025. Claude Sonnet 4.6 is the next practical step down if you prefer Anthropic, while open-weight self-hosters should look at GLM-5 (Reasoning).

The bottom line

GPT-5.4 is the best fully covered LLM for math in 2026, but the top cluster is saturated. GPT-5.2 Pro ties it, GPT-5.3 Codex is the value pick, and Claude Opus 4.6 is the best choice when explanation quality matters.

Use the live math leaderboard for sortable scores, the best math hub for recommendations, the AIME and HMMT explainer for benchmark context, and the best open-source LLM guide if you need self-hosted math models.

Data from BenchLM.ai. Last updated May 2026.