What is the best LLM for math?

The best LLMs for math are ranked by competition-level benchmarks like AIME and HMMT, with top models achieving strong scores on problems from real math competitions.

How are math benchmarks scored for LLMs?

Math benchmarks score models on correctness of final answers to problems ranging from algebra to advanced competition mathematics, testing both calculation and mathematical reasoning.

What benchmarks test math ability in AI models?

Major math benchmarks include AIME (competition-level problems), HMMT (Harvard-MIT tournament problems), and BRUMO, each testing progressively harder mathematical reasoning.

Math

Math Benchmarks

Name: Math Benchmarks — LLM Leaderboard
Creator: BenchLM.ai

Mathematical reasoning and problem solving

AIME 2023 · AIME 2024 · AIME 2025 · HMMT Feb 2023 · HMMT Feb 2024 · HMMT Feb 2025 · BRUMO 2025 · MATH-500

Math benchmarks test whether AI models can solve competition-level mathematics problems requiring creative insight, multi-step reasoning, and formal proof construction. Mathematics carries a 5% weight in BenchLM.ai's overall scoring system.

A key challenge with math benchmarks in 2026 is saturation. Frontier models score 95-99% on AIME and HMMT — competition math is effectively solved by AI. The 1-2 point differences between top models are within noise range. BRUMO and MATH-500 still show more meaningful separation, particularly among mid-tier and open-weight models.

Reasoning-enhanced models (those using chain-of-thought) consistently outperform standard models on math by 10-20 points. If mathematical reasoning is critical for your use case, prioritize models with explicit reasoning capabilities. See our math rankings for the full leaderboard, or read our AIME & HMMT explainer.

124 models


1 GPT-5.4 Pro OpenAI	Closed	Reasoning	1.05M	91	99%	99%	99%	96%	98%	97%	97%	99%
2 GPT-5.2 Pro OpenAI	Closed	Reasoning	400K	90	99%	99%	99%	96%	98%	97%	97%	99%
3 GPT-5.4 OpenAI	Closed	Reasoning	1.05M	90	99%	99%	99%	96%	98%	97%	97%	99%
4 GPT-5.3 Codex OpenAI	Closed	Reasoning	400K	89	99%	99%	98%	95%	97%	96%	96%	99%
5 GPT-5.2 OpenAI	Closed	Reasoning	400K	88	99%	99%	98%	95%	97%	96%	96%	98%
6 GPT-5.3 Instant OpenAI	Closed	Reasoning	128K	87	99%	99%	98%	95%	97%	96%	96%	98%
7 GPT-5.3-Codex-Spark OpenAI	Closed	Reasoning	256K	87	98%	98%	97%	94%	96%	95%	95%	98%
8 Claude Opus 4.6 Anthropic	Closed	Standard	1M	85	99%	99%	98%	95%	97%	96%	96%	98%
9 GPT-5.2 Instant OpenAI	Closed	Reasoning	128K	85	99%	99%	98%	95%	97%	96%	96%	98%
10 GPT-5.2-Codex OpenAI	Closed	Reasoning	400K	85	99%	99%	98%	95%	97%	96%	96%	94%
11 Gemini 3.1 Pro Google	Closed	Standard	1M	84	99%	99%	98%	95%	97%	96%	96%	97%
12 GPT-5.1-Codex-Max OpenAI	Closed	Reasoning	400K	84	99%	99%	98%	95%	97%	96%	96%	93%
13 Grok 4.1 xAI	Closed	Standard	1M	84	99%	99%	98%	95%	97%	96%	96%	97%
14 Gemini 3 Pro Deep Think Google	Closed	Reasoning	2M	81	99%	99%	98%	95%	97%	96%	96%	92%
15 GPT-5.1 OpenAI	Closed	Reasoning	200K	80	99%	99%	98%	95%	97%	96%	96%	94%
16 GPT-5 (high) OpenAI	Closed	Reasoning	128K	79	95%	97%	96%	91%	93%	92%	94%	94%
17 Claude Sonnet 4.6 Anthropic	Closed	Standard	200K	78	99%	99%	98%	95%	97%	96%	96%	91%
18 GLM-5 (Reasoning) Zhipu AI	Open	Reasoning	200K	78	98%	99%	98%	94%	96%	95%	96%	92%
19 GPT-5 (medium) OpenAI	Closed	Reasoning	128K	78	93%	95%	94%	89%	91%	90%	92%	92%
20 Claude Opus 4.5 Anthropic	Closed	Standard	200K	77	99%	99%	98%	95%	97%	96%	96%	89%
21 Gemini 3 Pro Google	Closed	Standard	2M	77	99%	99%	98%	95%	97%	96%	96%	91%
22 o1-preview OpenAI	Closed	Reasoning	200K	77	94%	96%	95%	90%	92%	91%	93%	94%
23 Claude Sonnet 4.5 Anthropic	Closed	Standard	200K	76	97%	99%	98%	93%	95%	94%	96%	88%
24 Grok 4.1 Fast xAI	Closed	Standard	1M	76	96%	98%	97%	92%	94%	93%	95%	89%
25 Kimi K2.5 (Reasoning) Moonshot AI	Closed	Reasoning	128K	76	94%	96%	95%	90%	92%	91%	93%	92%

Showing 25 of 124

Math benchmark updates

Get notified when MATH, GSM8K, or math reasoning scores change.

Free. No spam. Unsubscribe anytime. We only store derived location metadata for consent routing.

About Math Benchmarks

High school mathematics competition