Math Benchmarks
Mathematical reasoning and problem solving - Compare AI models across 7 mathematical benchmarks including AIME, HMMT, BRUMO, and more.
Filters & Search
Filter models by creator, type, reasoning, or search by name to find the perfect AI model for your needs
Math Benchmark Results
Showing 25 of 52 models • Click column headers to sort
1 GPT-5 (high) OpenAI | OpenAI | Proprietary | Reasoning | 128K | 72 | 95% | 97% | 96% | 91% | 93% | 92% | 94% |
2 o1-preview OpenAI | OpenAI | Proprietary | Reasoning | 200K | 71 | 94% | 96% | 95% | 90% | 92% | 91% | 93% |
3 GPT-5 (medium) OpenAI | OpenAI | Proprietary | Reasoning | 128K | 70 | 93% | 95% | 94% | 89% | 91% | 90% | 92% |
4 Grok 4 xAI | xAI | Proprietary | Non-Reasoning | 128K | 69 | 87% | 89% | 88% | 84% | 86% | 85% | 87% |
5 GPT-5 mini OpenAI | OpenAI | Proprietary | Reasoning | 128K | 68 | 90% | 92% | 91% | 86% | 88% | 87% | 89% |
6 o3-pro OpenAI | OpenAI | Proprietary | Reasoning | 200K | 68 | 90% | 92% | 91% | 86% | 88% | 87% | 89% |
7 o3 OpenAI | OpenAI | Proprietary | Reasoning | 200K | 67 | 88% | 90% | 89% | 84% | 86% | 85% | 87% |
8 Qwen2.5-1M Alibaba | Alibaba | Open Weight | Non-Reasoning | 1M | 66 | 85% | 87% | 86% | 81% | 83% | 82% | 84% |
9 Qwen2.5-72B Alibaba | Alibaba | Open Weight | Non-Reasoning | 128K | 65 | 84% | 86% | 85% | 80% | 82% | 81% | 83% |
10 o4-mini (high) OpenAI | OpenAI | Proprietary | Non-Reasoning | 200K | 65 | 83% | 85% | 84% | 79% | 81% | 80% | 82% |
11 Gemini 2.5 Pro Google | Proprietary | Non-Reasoning | 2M | 65 | 84% | 86% | 85% | 80% | 82% | 81% | 83% | |
12 DeepSeek Coder 2.0 DeepSeek | DeepSeek | Open Weight | Non-Reasoning | 128K | 64 | 81% | 83% | 82% | 77% | 79% | 78% | 80% |
13 DeepSeek LLM 2.0 DeepSeek | DeepSeek | Open Weight | Non-Reasoning | 128K | 63 | 80% | 82% | 81% | 76% | 78% | 77% | 79% |
14 Claude 4.1 Opus Anthropic | Anthropic | Proprietary | Non-Reasoning | 200K | 61 | 76% | 78% | 77% | 72% | 74% | 73% | 75% |
15 Claude 4 Sonnet Anthropic | Anthropic | Proprietary | Non-Reasoning | 200K | 59 | 73% | 75% | 74% | 69% | 71% | 70% | 72% |
16 Llama 3.1 405B Meta | Meta | Open Weight | Non-Reasoning | 128K | 58 | 70% | 72% | 71% | 66% | 68% | 67% | 69% |
17 Mistral Large 2 Mistral | Mistral | Proprietary | Non-Reasoning | 128K | 57 | 68% | 70% | 69% | 64% | 66% | 65% | 67% |
18 GPT-4o OpenAI | OpenAI | Proprietary | Non-Reasoning | 128K | 56 | 66% | 68% | 67% | 62% | 64% | 63% | 65% |
19 Claude 3.5 Sonnet Anthropic | Anthropic | Proprietary | Non-Reasoning | 200K | 55 | 65% | 67% | 66% | 61% | 63% | 62% | 64% |
20 Gemini 1.5 Pro Google | Proprietary | Non-Reasoning | 2M | 54 | 64% | 66% | 65% | 60% | 62% | 61% | 63% | |
21 Mistral 8x7B Mistral | Mistral | Open Weight | Non-Reasoning | 32K | 52 | 65% | 67% | 66% | 61% | 63% | 62% | 64% |
22 Gemini 1.0 Pro Google | Proprietary | Non-Reasoning | 32K | 52 | 62% | 64% | 63% | 58% | 60% | 59% | 61% | |
23 Claude 3 Opus Anthropic | Anthropic | Proprietary | Non-Reasoning | 200K | 51 | 61% | 63% | 62% | 57% | 59% | 58% | 60% |
24 GPT-4 Turbo OpenAI | OpenAI | Proprietary | Non-Reasoning | 128K | 50 | 60% | 62% | 61% | 56% | 58% | 57% | 59% |
25 Llama 3 70B Meta | Meta | Open Weight | Non-Reasoning | 128K | 48 | 58% | 60% | 59% | 54% | 56% | 55% | 57% |
Showing 25 of 52 models
About Math Benchmarks
AIME 2023
High school mathematics competition
AIME 2024
High school mathematics competition
AIME 2025
High school mathematics competition
HMMT Feb 2023
Collegiate mathematics competition
HMMT Feb 2024
Collegiate mathematics competition
HMMT Feb 2025
Collegiate mathematics competition
BRUMO 2025
University-level mathematics olympiad