Skip to main content

American Invitational Mathematics Examination 2023 (AIME 2023)

A 15-question, 3-hour examination where each answer is an integer from 000 to 999. Serves as the intermediate step between AMC 10/12 and the USA Mathematical Olympiad (USAMO).

How BenchLM shows AIME 2023 right now

BenchLM is tracking AIME 2023 in the local dataset, but exact-source verification records for these rows are still being attached. To avoid a blank benchmark page, BenchLM shows the current tracked rows below as a display-only reference table.

These tracked rows are useful for inspection and spot-checking, but until exact-source attachments are completed they should not be treated as fully verified public benchmark rows.

106 tracked modelsLocal tracked rowsAwaiting exact-source attachmentsDisplay only

Tracked score on AIME 2023 — April 20, 2026

BenchLM mirrors the published tracked score view for AIME 2023. GPT-5.1-Codex-Max leads the public snapshot at 99% , followed by GPT-5.2-Codex (99%) and GPT-5.3 Codex (99%). BenchLM does not use these results to rank models overall.

106 modelsMathStaleDisplay onlyUpdated April 20, 2026

The published AIME 2023 snapshot is tightly clustered at the top: GPT-5.1-Codex-Max sits at 99%, while the third row is only 0.0 points behind. The broader top-10 spread is 0.0 points, so many of the published scores sit in a relatively narrow band.

106 models have been evaluated on AIME 2023. The benchmark falls in the Math category. This category carries a 5% weight in BenchLM.ai's overall scoring system. AIME 2023 is currently displayed for reference but excluded from the scoring formula, so it does not directly affect overall rankings.

About AIME 2023

Year

2023

Tasks

15 problems

Format

Integer answers 000-999

Difficulty

High school olympiad level

AIME is designed for students who score well on AMC 10/12. Problems require creative problem-solving and mathematical insight beyond standard high school curriculum. Only the top scorers qualify for USAMO.

BenchLM freshness & provenance

Version

AIME 2023 2023

Refresh cadence

Static

Staleness state

Stale

Question availability

Public benchmark set

StaleDisplay only

BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.

Tracked score table (106 models)

1
GPT-5.1-Codex-Maxgpt-5-1-codex-max
99%
2
GPT-5.2-Codexgpt-5-2-codex
99%
3
GPT-5.3 Codexgpt-5-3-codex
99%
4
GPT-5.4gpt-5-4
99%
5
Grok 4.1grok-4-1
99%
6
Gemini 3 Pro Deep Thinkgemini-3-pro-deep-think
99%
7
Claude Opus 4.6claude-opus-4-6
99%
8
GPT-5.1gpt-5-1
99%
9
GPT-5.2gpt-5-2
99%
10
Claude Sonnet 4.6claude-sonnet-4-6
99%
11
Gemini 3 Progemini-3-pro
99%
12
Claude Opus 4.5claude-opus-4-5
99%
13
GPT-5.2 Progpt-5-2-pro
99%
14
GPT-5.3 Instantgpt-5-3-instant
99%
15
GPT-5.2 Instantgpt-5-2-instant
99%
16
GLM-5 (Reasoning)glm-5-reasoning
98%
17
GPT-5.3-Codex-Sparkgpt-5-3-codex-spark
98%
18
Claude Sonnet 4.5claude-sonnet-4-5
97%
19
Grok 4.1 Fastgrok-4-1-fast
96%
20
GPT-5 (high)gpt-5-high
95%
21
94%
22
Kimi K2.5 (Reasoning)kimi-k2-5-reasoning
94%
23
GPT-5 (medium)gpt-5-medium
93%
24
Qwen3.5 397B (Reasoning)qwen3-5-397b-reasoning
93%
25
90%
26
GPT-5 minigpt-5-mini
90%
27
88%
28
GLM-5glm-5
88%
29
Grok 4grok-4
87%
30
DeepSeek V3.2 (Thinking)deepseek-v3-2-thinking
87%
31
GLM-4.7glm-4-7
86%
32
Qwen2.5-1Mqwen2-5-1m
85%
33
Step 3.5 Flashstep-3-5-flash
85%
34
Gemini 2.5 Progemini-2-5-pro
84%
35
Qwen2.5-72Bqwen2-5-72b
84%
36
DeepSeek V3.2deepseek-v3-2
84%
37
Qwen3.5 397Bqwen3-5-397b
83%
38
o4-mini (high)o4-mini-high
83%
39
DeepSeek Coder 2.0deepseek-coder-2-0
81%
40
Mercury 2mercury-2
81%
41
DeepSeekMath V2deepseekmath-v2
80%
42
DeepSeek LLM 2.0deepseek-llm-2-0
80%
43
MiMo-V2-Flashmimo-v2-flash
79%
44
Kimi K2.5kimi-k2-5
77%
45
Claude 4.1 Opusclaude-4-1-opus
76%
46
Mistral Large 3mistral-large-3
76%
47
Nemotron 3 Ultra 500Bnemotron-3-ultra-500b
74%
48
Aion-2.0aion-2-0
74%
49
Claude 4 Sonnetclaude-4-sonnet
73%
50
Ministral 3 14B (Reasoning)ministral-3-14b-reasoning
73%
51
MiniMax M2.5minimax-m2-5
73%
52
Seed 1.6seed-1-6
72%
53
Seed-2.0-Liteseed-2-0-lite
71%
54
Gemini 3 Flashgemini-3-flash
70%
55
Llama 3.1 405Bllama-3-1-405b
70%
56
Claude Haiku 4.5claude-haiku-4-5
68%
57
Mistral Large 2mistral-large-2
68%
58
Ministral 3 14Bministral-3-14b
68%
59
Nemotron 3 Super 120B A12Bnemotron-3-super-120b-a12b
67%
60
GPT-4ogpt-4o
66%
61
GLM-4.7-Flashglm-4-7-flash
66%
62
Nemotron 3 Super 100Bnemotron-3-super-100b
65%
63
Claude 3.5 Sonnetclaude-3-5-sonnet
65%
64
Mistral 8x7Bmistral-8x7b
65%
65
Grok Code Fast 1grok-code-fast-1
64%
66
Gemini 1.5 Progemini-1-5-pro
64%
67
Seed 1.6 Flashseed-1-6-flash
64%
68
Gemini 3.1 Flash-Litegemini-3-1-flash-lite
63%
69
Gemini 1.0 Progemini-1-0-pro
62%
70
Seed-2.0-Miniseed-2-0-mini
62%
71
Claude 3 Opusclaude-3-opus
61%
72
GPT-4 Turbogpt-4-turbo
60%
73
Llama 3 70Bllama-3-70b
58%
74
Nemotron 3 Nano 30Bnemotron-3-nano-30b
57%
75
Claude 3 Haikuclaude-3-haiku
56%
76
Nemotron-4 15Bnemotron-4-15b
54%
77
Moonshot v1moonshot-v1
53%
78
Z-1z-1
52%
79
GPT-OSS 120Bgpt-oss-120b
51%
80
Gemini 2.5 Flashgemini-2-5-flash
50%
81
Nemotron Ultra 253Bnemotron-ultra-253b
49%
82
Llama 4 Behemothllama-4-behemoth
48%
83
Llama 4 Scoutllama-4-scout
47%
84
Llama 4 Maverickllama-4-maverick
46%
85
LFM2-24B-A2Blfm2-24b-a2b
46%
86
Gemma 3 27Bgemma-3-27b
45%
87
DeepSeek-R1deepseek-r1
44%
88
Grok 3 [Beta]grok-3-beta
42%
89
Nova Pronova-pro
41%
90
Qwen3 235B 2507 (Reasoning)qwen3-235b-2507-reasoning
40%
91
Qwen3 235B 2507qwen3-235b-2507
39%
92
Claude 4.1 Opus Thinkingclaude-4-1-opus-thinking
38%
93
GLM-4.5glm-4-5
37%
94
MiniMax M1 80kminimax-m1-80k
36%
95
GLM-4.5-Airglm-4-5-air
35%
96
DeepSeek V3.1 (Reasoning)deepseek-v3-1-reasoning
34%
97
DeepSeek V3.1deepseek-v3-1
33%
98
Ministral 3 8B (Reasoning)ministral-3-8b-reasoning
33%
99
GPT-OSS 20Bgpt-oss-20b
31%
100
Mistral 7B v0.3mistral-7b-v0-3
30%
101
Ministral 3 8Bministral-3-8b
30%
102
Mistral 8x7B v0.2mistral-8x7b-v0-2
29%
103
LFM2.5-1.2B-Thinkinglfm2-5-1-2b-thinking
28%
104
Ministral 3 3B (Reasoning)ministral-3-3b-reasoning
27%
105
LFM2.5-1.2B-Instructlfm2-5-1-2b-instruct
24%
106
Ministral 3 3Bministral-3-3b
23%

FAQ

What does AIME 2023 measure?

A 15-question, 3-hour examination where each answer is an integer from 000 to 999. Serves as the intermediate step between AMC 10/12 and the USA Mathematical Olympiad (USAMO).

Which model leads the published AIME 2023 snapshot?

GPT-5.1-Codex-Max currently leads the published AIME 2023 snapshot with a tracked score of 99%. BenchLM shows this benchmark for display only and does not use it in overall rankings.

How many models are evaluated on AIME 2023?

106 AI models are included in BenchLM's mirrored AIME 2023 snapshot, based on the public leaderboard captured on April 20, 2026.

Last updated: April 20, 2026 · mirrored from the public benchmark leaderboard

The AI models change fast. We track them for you.

For engineers, researchers, and the plain curious — a weekly brief on new models, ranking shifts, and pricing changes.

Free. No spam. Unsubscribe anytime.