What is the best LLM for math?

The best LLMs for math are ranked by competition-level benchmarks like AIME and HMMT, with top models achieving strong scores on problems from real math competitions.

How are math benchmarks scored for LLMs?

Math benchmarks score models on correctness of final answers to problems ranging from algebra to advanced competition mathematics, testing both calculation and mathematical reasoning.

What benchmarks test math ability in AI models?

Major math benchmarks include AIME (competition-level problems), HMMT (Harvard-MIT tournament problems), and BRUMO, each testing progressively harder mathematical reasoning.

Math

Math Benchmarks — AIME, HMMT & MATH-500 Leaderboard

Name: Math Benchmarks — LLM Leaderboard
Creator: BenchLM.ai

Mathematical reasoning and problem solving

Bottom line: Competition math is largely solved by frontier models — AIME and HMMT are saturated. BRUMO and MATH-500 still show meaningful separation.

AIME 2023 · AIME 2024 · AIME 2025 · AIME25 (Arcee) · HMMT Feb 2023 · HMMT Feb 2024 · HMMT Feb 2025 · BRUMO 2025 · MATH-500

Best Math picks

BenchLM summaries for math plus the practical tradeoffs users check next: open weights, price, speed, latency, and context.

How BenchLM scores these

Best Math

GPT-5.3 Codex

100

category score

OpenAI

Best Open Weight

DeepSeek V4 Pro (Max)

overall score

DeepSeek

Cheapest

Qwen3.6-27B

$0.00

avg / 1M tokens

Alibaba

Fastest

Mercury 2

789

tokens / sec

Inception

Lowest Latency

LFM2-24B-A2B

0.42s

TTFT

LiquidAI

Largest Context

Nemotron 3 Ultra 500B

10M

context window

NVIDIA

Top AI Models for Math — May 2026

As of May 2026, GPT-5.3 Codex leads the provisional math leaderboard with a weighted score of 100.0%, followed by Grok 4.1 (99.4%) and GPT-5.2-Codex (97.7%). BenchLM is currently showing 86 provisional-ranked models and 0 verified-ranked models in this category.

1Proprietary

GPT-5.3 Codex

OpenAI

100.0%weighted

2Proprietary

Grok 4.1

xAI

99.4%weighted

3Proprietary

GPT-5.2-Codex

OpenAI

97.7%weighted

86 provisional-ranked0 verified-ranked9 benchmarksUpdated May 12, 2026

What changed

Claude Mythos Preview leads math with top BRUMO and MATH-500 scores.

GPT-5.4 close second, with near-perfect AIME scores.

Gemini 3.1 Pro strong third — best value option for math-heavy workloads.

How to choose

Research-level math reasoning?

Claude Mythos Preview — best BRUMO scores

Competition math (AIME-level)?

GPT-5.4 — near-perfect competition scores

Math on a budget?

Gemini 3.1 Pro — strong math at $1.25/$5

Need explicit step-by-step proofs?

GPT-5.4 Pro — reasoning model with full CoT

Top models by benchmark

High school mathematics competition(25% of category score)

1Kimi K2.5 (Reasoning)

96.1

96.1

~95.7

~94.1

~87

Math Leaderboard

Updated May 12, 2026

Sorted by math weighted score. Switch between provisional-ranked and verified-ranked modes to see the broader public dataset versus sourced-only ranking. Click column headers to re-sort by overall score or any benchmark.

86 ranked models

CSV JSON

Provisional-ranked mode includes source-unverified non-generated benchmark evidence.P = provisional benchmark row


1 GPT-5.3 Codex OpenAI	Closed	Reasoning	400K	$1.75 / $14.00	79	88.26s	100%	Est.87	—	—	—	—	—	—	—	—	—
2 Grok 4.1 xAI	Closed	Standard	1M	N/A	N/A	N/A	99.4%	Est.90	—	—	—	—	—	—	—	—	—
3 GPT-5.2-Codex OpenAI	Closed	Reasoning	400K	$1.75 / $14.00	123	87.34s	97.7%	Est.77	—	—	—	—	—	—	—	—	—
4 GPT-5.1-Codex-Max OpenAI	Closed	Reasoning	400K	$1.25 / $10.00	N/A	N/A	97.2%	Est.76	—	—	—	—	—	—	—	—	—
5 Gemini 3 Pro Deep Think Google	Closed	Reasoning	2M	N/A	N/A	N/A	95.4%	Est.91	—	—	—	—	—	—	—	—	—
6 Claude Opus 4.5 Anthropic	Closed	Standard	200K	$5.00 / $25.00	46	1.01s	94.9%	77	—	—	—	—	—	—	—	—	—
7 GPT-5.4 OpenAI	Closed	Reasoning	1.05M	$2.50 / $15.00	74	151.79s	94.4%	89	—	—	—	—	—	—	—	—	—
8 o1-preview OpenAI	Closed	Reasoning	200K	$15.00 / $60.00	N/A	N/A	94.1%	Est.83	—	—	—	—	—	—	—	—	—
9 Grok 4.1 Fast xAI	Closed	Standard	1M	$0.20 / $0.50	138	0.54s	93.7%	Est.70	—	—	—	—	—	—	—	—	—
10 GLM-5 (Reasoning) Z.AI Self-host	Open	Reasoning	200K	$1.00 / $3.20	N/A	N/A	92.4%	Est.82	—	—	—	—	—	—	—	—	—
11 Qwen3.5 397B (Reasoning) Alibaba Self-host	Open	Reasoning	128K	$0.60 / $3.60	N/A	N/A	92.3%	Est.79	—	—	—	—	—	—	—	—	—
12 GPT-5 (medium) OpenAI	Closed	Reasoning	128K	N/A	83	36.28s	91.7%	Est.71	—	—	—	—	—	—	—	—	—
13 GLM-5 Z.AI Self-host	Open	Standard	200K	$1.00 / $3.20	74	1.64s	91.3%	67	—	—	—	93.3%	—	—	—	—	—
14 Sarvam 105B Sarvam Self-host	Open	Reasoning	128K	$0.00 / $0.00	N/A	N/A	90.4%	Est.39	—	—	—	—	—	—	—	—	—
15 GLM-5.1 Z.AI Self-host	Open	Reasoning	203K	$1.40 / $4.40	N/A	N/A	89.6%	83	—	—	—	—	—	—	—	—	—
16 Claude Sonnet 4.5 Anthropic	Closed	Standard	200K	$3.00 / $15.00	N/A	N/A	87.7%	Est.66	—	—	87%	—	—	—	—	—	—
17 o3-pro OpenAI	Closed	Reasoning	200K	$20.00 / $80.00	27	84.93s	86.4%	Est.58	—	—	—	—	—	—	—	—	—
18 Claude Opus 4.6 Anthropic	Closed	Standard	1M	$5.00 / $25.00	40	1.78s	86.3%	87	—	—	—	99.8%	—	—	—	—	—
19 o3 OpenAI	Closed	Reasoning	200K	$2.00 / $8.00	118	5.38s	83.4%	Est.58	—	—	—	—	—	—	—	—	—
20 MiMo-V2-Flash Xiaomi Self-host	Open	Reasoning	256K	$0.00 / $0.00	129	2.14s	82.1%	Est.60	—	—	94.1%	—	—	—	—	—	—
21 GPT-5.2 OpenAI	Closed	Reasoning	400K	$1.75 / $14.00	73	130.34s	81.7%	81	—	—	—	—	—	—	—	—	—
22 Gemini 3 Pro Google	Closed	Standard	2M	$2.00 / $12.00	109	32.65s	81.2%	81	—	—	—	—	—	—	—	—	—
23 Sarvam 30B Sarvam Self-host	Open	Reasoning	64K	$0.00 / $0.00	N/A	N/A	81.2%	Est.41	—	—	—	—	—	—	—	—	—
24 Grok 4 xAI	Closed	Standard	128K	N/A	54	15.60s	80%	Est.65	—	—	—	—	—	—	—	—	—
25 GLM-4.7 Z.AI Self-host	Open	Reasoning	200K	$0.00 / $0.00	82	1.10s	78.7%	Est.69	—	—	95.7%	—	—	—	—	—	—

Showing 25 of 86

These rankings update weekly

Get notified when models move. One email a week with what changed and why.

Free. No spam. Unsubscribe anytime.

Score in Context

What these scores mean

Math carries a 5% weight in overall scoring — relatively low because frontier models have saturated the main competition benchmarks. AIME and HMMT scores are 95-99% across top models. The weighted score now relies on BRUMO and MATH-500, which still show meaningful separation.

Known limitations

AIME and HMMT are effectively solved by AI — they are displayed for reference but no longer factor into the weighted score. If math reasoning is critical for your use case, look at BRUMO scores specifically, and consider models with explicit reasoning capabilities (chain-of-thought). See the AIME & HMMT explainer.

How we weight

Mathematics carries a 5% weight in BenchLM.ai's overall scoring. Frontier models score 95-99% on AIME and HMMT — competition math is effectively solved by AI.

AIME and HMMT are still displayed for reference but no longer factor into the weighted score due to saturation. BRUMO and MATH-500 show more meaningful separation. If mathematical reasoning is critical, prioritize models with explicit reasoning capabilities. See the math leaderboard or read the AIME & HMMT explainer.

Leaderboards exclude benchmark rows that BenchLM generated from other scores or cloned from reference models. When a weighted benchmark is missing after that filter, the category falls back to the remaining trustworthy public rows instead of filling the gap with synthetic values.

The full scoring rules, freshness handling, and runtime/pricing caveats live on the BenchLM methodology page.

Benchmark	Weight	Status	Description
AIME 2023	—	Display only	High school mathematics competition
AIME 2024	—	Display only	High school mathematics competition
AIME 2025	25%	Weighted	High school mathematics competition
AIME25 (Arcee)	—	Display only	Display-only AIME25 reference from Arcee AI's Trinity-Large-Thinking launch chart.
HMMT Feb 2023	—	Display only	Collegiate mathematics competition
HMMT Feb 2024	—	Display only	Collegiate mathematics competition
HMMT Feb 2025	—	Display only	Collegiate mathematics competition
BRUMO 2025	25%	Weighted	University-level mathematics olympiad
MATH-500	15%	Weighted	Curated 500-problem subset of the MATH dataset covering algebra, geometry, number theory, and more

Math benchmark updates

Math model rankings change weekly. Stay current.

Free. No spam. Unsubscribe anytime.

About Math Benchmarks

High school mathematics competition

Best LLMs Overall

Top models ranked across all benchmark categories.

View

Reasoning Benchmarks

Multi-step inference and logical deduction leaderboard.

View

Best Reasoning Models

Models with chain-of-thought for math and reasoning.

View

LLM Selector Quiz

Find the best model for math-heavy tasks.

View

Math Benchmarks — AIME, HMMT & MATH-500 Leaderboard

Best Math picks

Top AI Models for Math — May 2026

What changed

How to choose

Top models by benchmark

Math Leaderboard

These rankings update weekly

Score in Context

What these scores mean

Known limitations

How we weight

Math benchmark updates

About Math Benchmarks

Related

Stay ahead of the LLM curve