Question 1

What is the best LLM for math?

Accepted Answer

The best LLMs for math are ranked by competition-level benchmarks like AIME and HMMT, with top models achieving strong scores on problems from real math competitions.

Question 2

How are math benchmarks scored for LLMs?

Accepted Answer

Math benchmarks score models on correctness of final answers to problems ranging from algebra to advanced competition mathematics, testing both calculation and mathematical reasoning.

Question 3

What benchmarks test math ability in AI models?

Accepted Answer

Major math benchmarks include AIME (competition-level problems), HMMT (Harvard-MIT tournament problems), and BRUMO, each testing progressively harder mathematical reasoning.


1 GPT-5.3 Codex OpenAI	Closed	Reasoning	400K	92	99%	99%	98%	95%	97%	96%	96%	99%
2 GPT-5.4 OpenAI	Closed	Reasoning	1M	91	99%	99%	98%	95%	97%	96%	96%	99%
3 GPT-5.2 OpenAI	Closed	Reasoning	400K	91	99%	99%	98%	95%	97%	96%	96%	98%
4 Claude Opus 4.6 Anthropic	Closed	Standard	1M	90	99%	99%	98%	95%	97%	96%	96%	98%
5 Gemini 3.1 Pro Google	Closed	Standard	1M	89	99%	99%	98%	95%	97%	96%	96%	97%
6 Grok 4.1 xAI	Closed	Standard	128K	89	99%	99%	98%	95%	97%	96%	96%	97%
7 GPT-5.2-Codex OpenAI	Closed	Reasoning	400K	88	99%	99%	98%	95%	97%	96%	96%	94%
8 GPT-5.1-Codex-Max OpenAI	Closed	Reasoning	400K	87	99%	99%	98%	95%	97%	96%	96%	93%
9 Claude Sonnet 4.6 Anthropic	Closed	Standard	1M	86	99%	99%	98%	95%	97%	96%	96%	91%
10 Gemini 3 Pro Deep Think Google	Closed	Reasoning	2M	85	99%	99%	98%	95%	97%	96%	96%	92%
11 Claude Opus 4.5 Anthropic	Closed	Standard	200K	85	99%	99%	98%	95%	97%	96%	96%	89%
12 GPT-5.1 OpenAI	Closed	Reasoning	400K	85	99%	99%	98%	95%	97%	96%	96%	94%
13 GPT-5 (high) OpenAI	Closed	Reasoning	128K	84	95%	97%	96%	91%	93%	92%	94%	94%
14 Gemini 3 Pro Google	Closed	Standard	2M	84	99%	99%	98%	95%	97%	96%	96%	91%
15 GLM-5 (Reasoning) Zhipu AI	Open	Reasoning	200K	84	98%	99%	98%	94%	96%	95%	96%	92%
16 o1-preview OpenAI	Closed	Reasoning	200K	83	94%	96%	95%	90%	92%	91%	93%	94%
17 Claude Sonnet 4.5 Anthropic	Closed	Standard	1M	83	97%	99%	98%	93%	95%	94%	96%	88%
18 Grok 4.1 Fast xAI	Closed	Standard	2M	83	96%	98%	97%	92%	94%	93%	95%	89%
19 GPT-5 (medium) OpenAI	Closed	Reasoning	128K	82	93%	95%	94%	89%	91%	90%	92%	92%
20 Kimi K2.5 (Reasoning) Moonshot AI	Open	Reasoning	128K	82	94%	96%	95%	90%	92%	91%	93%	92%
21 Qwen3.5 397B (Reasoning) Alibaba	Open	Reasoning	128K	82	93%	95%	94%	89%	91%	90%	92%	93%
22 o3-pro OpenAI	Closed	Reasoning	200K	77	90%	92%	91%	86%	88%	87%	89%	89%
23 o3 OpenAI	Closed	Reasoning	200K	76	88%	90%	89%	84%	86%	85%	87%	88%
24 DeepSeek V3.2 (Thinking) DeepSeek	Open	Reasoning	128K	75	87%	89%	88%	83%	85%	84%	86%	84%
25 GPT-5 mini OpenAI	Closed	Reasoning	128K	74	90%	92%	91%	86%	88%	87%	89%	85%

Math Benchmarks

About Math Benchmarks