American Invitational Mathematics Examination 2023 (AIME 2023)

Name: American Invitational Mathematics Examination 2023
Creator: BenchLM

A 15-question, 3-hour examination where each answer is an integer from 000 to 999. Serves as the intermediate step between AMC 10/12 and the USA Mathematical Olympiad (USAMO).

How BenchLM shows AIME 2023 right now

BenchLM is tracking AIME 2023 in the local dataset, but exact-source verification records for these rows are still being attached. To avoid a blank benchmark page, BenchLM shows the current tracked rows below as a display-only reference table.

These tracked rows are useful for inspection and spot-checking, but until exact-source attachments are completed they should not be treated as fully verified public benchmark rows.

105 tracked modelsLocal tracked rowsAwaiting exact-source attachmentsDisplay only

American Invitational Mathematics Examination

Tracked score on AIME 2023 — June 2, 2026

BenchLM mirrors the published tracked score view for AIME 2023. GPT-5.1-Codex-Max leads the public snapshot at 99% , followed by GPT-5.2-Codex (99%) and GPT-5.3 Codex (99%). BenchLM does not use these results to rank models overall.

GPT-5.1-Codex-Max

OpenAI

gpt-5-1-codex-max

99%

Overall —

GPT-5.2-Codex

OpenAI

gpt-5-2-codex

99%

Overall —

GPT-5.3 Codex

OpenAI

gpt-5-3-codex

99%

Overall —

105 modelsMathStaleDisplay onlyUpdated June 2, 2026

The published AIME 2023 snapshot is tightly clustered at the top: GPT-5.1-Codex-Max sits at 99%, while the third row is only 0.0 points behind. The broader top-10 spread is 0.0 points, so many of the published scores sit in a relatively narrow band.

105 models have been evaluated on AIME 2023. The benchmark falls in the Math category. This category carries a 5% weight in BenchLM.ai's overall scoring system. AIME 2023 is currently displayed for reference but excluded from the scoring formula, so it does not directly affect overall rankings.

About AIME 2023

Year

2023

Tasks

15 problems

Format

Integer answers 000-999

Difficulty

High school olympiad level

AIME is designed for students who score well on AMC 10/12. Problems require creative problem-solving and mathematical insight beyond standard high school curriculum. Only the top scorers qualify for USAMO.

American Invitational Mathematics Examination Public benchmark source

BenchLM freshness & provenance

Version

AIME 2023 2023

Refresh cadence

Static

Staleness state

Stale

Question availability

Public benchmark set

StaleDisplay only

BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.

Tracked score table (105 models)

GPT-5.1-Codex-Maxgpt-5-1-codex-max

OpenAI

99%

GPT-5.2-Codexgpt-5-2-codex

OpenAI

99%

GPT-5.3 Codexgpt-5-3-codex

OpenAI

99%

Grok 4.1grok-4-1

xAI

99%

GPT-5.4gpt-5-4

OpenAI

99%

Gemini 3 Pro Deep Thinkgemini-3-pro-deep-think

Google

99%

Claude Opus 4.6claude-opus-4-6

Anthropic

99%

GPT-5.1gpt-5-1

OpenAI

99%

GPT-5.2gpt-5-2

OpenAI

99%

Claude Sonnet 4.6claude-sonnet-4-6

Anthropic

99%

Gemini 3 Progemini-3-pro

Google

99%

Claude Opus 4.5claude-opus-4-5

Anthropic

99%

GPT-5.2 Progpt-5-2-pro

OpenAI

99%

GPT-5.3 Instantgpt-5-3-instant

OpenAI

99%

GPT-5.2 Instantgpt-5-2-instant

OpenAI

99%

GLM-5 (Reasoning)glm-5-reasoning

Z.AI

98%

GPT-5.3-Codex-Sparkgpt-5-3-codex-spark

OpenAI

98%

Claude Sonnet 4.5claude-sonnet-4-5

Anthropic

97%

Grok 4.1 Fastgrok-4-1-fast

xAI

96%

GPT-5 (high)gpt-5-high

OpenAI

95%

o1-preview

OpenAI

94%

Kimi K2.5 (Reasoning)kimi-k2-5-reasoning

Moonshot AI

94%

GPT-5 (medium)gpt-5-medium

OpenAI

93%

Qwen3.5 397B (Reasoning)qwen3-5-397b-reasoning

Alibaba

93%

o3-pro

OpenAI

90%

GPT-5 minigpt-5-mini

OpenAI

90%

OpenAI

88%

GLM-5glm-5

Z.AI

88%

Grok 4grok-4

xAI

87%

DeepSeek V3.2 (Thinking)deepseek-v3-2-thinking

DeepSeek

87%

GLM-4.7glm-4-7

Z.AI

86%

Qwen2.5-1Mqwen2-5-1m

Alibaba

85%

Step 3.5 Flashstep-3-5-flash

StepFun

85%

Gemini 2.5 Progemini-2-5-pro

Google

84%

Qwen2.5-72Bqwen2-5-72b

Alibaba

84%

DeepSeek V3.2deepseek-v3-2

DeepSeek

84%

Qwen3.5 397Bqwen3-5-397b

Alibaba

83%

o4-mini (high)o4-mini-high

OpenAI

83%

DeepSeek Coder 2.0deepseek-coder-2-0

DeepSeek

81%

Mercury 2mercury-2

Inception

81%

DeepSeekMath V2deepseekmath-v2

DeepSeek

80%

DeepSeek LLM 2.0deepseek-llm-2-0

DeepSeek

80%

MiMo-V2-Flashmimo-v2-flash

Xiaomi

79%

Kimi K2.5kimi-k2-5

Moonshot AI

77%

Claude 4.1 Opusclaude-4-1-opus

Anthropic

76%

Mistral Large 3mistral-large-3

Mistral

76%

Aion-2.0aion-2-0

Aion Labs

74%

Claude 4 Sonnetclaude-4-sonnet

Anthropic

73%

Ministral 3 14B (Reasoning)ministral-3-14b-reasoning

Mistral

73%

MiniMax M2.5minimax-m2-5

MiniMax

73%

Seed 1.6seed-1-6

ByteDance

72%

Seed-2.0-Liteseed-2-0-lite

ByteDance

71%

Gemini 3 Flashgemini-3-flash

Google

70%

Llama 3.1 405Bllama-3-1-405b

Meta

70%

Claude Haiku 4.5claude-haiku-4-5

Anthropic

68%

Mistral Large 2mistral-large-2

Mistral

68%

Ministral 3 14Bministral-3-14b

Mistral

68%

Nemotron 3 Super 120B A12Bnemotron-3-super-120b-a12b

NVIDIA

67%

GPT-4ogpt-4o

OpenAI

66%

GLM-4.7-Flashglm-4-7-flash

Z.AI

66%

Nemotron 3 Super 100Bnemotron-3-super-100b

NVIDIA

65%

Claude 3.5 Sonnetclaude-3-5-sonnet

Anthropic

65%

Mistral 8x7Bmistral-8x7b

Mistral

65%

Grok Code Fast 1grok-code-fast-1

xAI

64%

Gemini 1.5 Progemini-1-5-pro

Google

64%

Seed 1.6 Flashseed-1-6-flash

ByteDance

64%

Gemini 3.1 Flash-Litegemini-3-1-flash-lite

Google

63%

Gemini 1.0 Progemini-1-0-pro

Google

62%

Seed-2.0-Miniseed-2-0-mini

ByteDance

62%

Claude 3 Opusclaude-3-opus

Anthropic

61%

GPT-4 Turbogpt-4-turbo

OpenAI

60%

Llama 3 70Bllama-3-70b

Meta

58%

Nemotron 3 Nano 30Bnemotron-3-nano-30b

NVIDIA

57%

Claude 3 Haikuclaude-3-haiku

Anthropic

56%

Nemotron-4 15Bnemotron-4-15b

NVIDIA

54%

Moonshot v1moonshot-v1

Moonshot AI

53%

Z-1z-1

52%

GPT-OSS 120Bgpt-oss-120b

OpenAI

51%

Gemini 2.5 Flashgemini-2-5-flash

Google

50%

Nemotron Ultra 253Bnemotron-ultra-253b

NVIDIA

49%

Llama 4 Behemothllama-4-behemoth

Meta

48%

Llama 4 Scoutllama-4-scout

Meta

47%

Llama 4 Maverickllama-4-maverick

Meta

46%

LFM2-24B-A2Blfm2-24b-a2b

LiquidAI

46%

Gemma 3 27Bgemma-3-27b

Google

45%

DeepSeek-R1deepseek-r1

DeepSeek

44%

Grok 3 [Beta]grok-3-beta

xAI

42%

Nova Pronova-pro

Amazon

41%

Qwen3 235B 2507 (Reasoning)qwen3-235b-2507-reasoning

Alibaba

40%

Qwen3 235B 2507qwen3-235b-2507

Alibaba

39%

Claude 4.1 Opus Thinkingclaude-4-1-opus-thinking

Anthropic

38%

GLM-4.5glm-4-5

Z.AI

37%

MiniMax M1 80kminimax-m1-80k

MiniMax

36%

GLM-4.5-Airglm-4-5-air

Z.AI

35%

DeepSeek V3.1 (Reasoning)deepseek-v3-1-reasoning

DeepSeek

34%

DeepSeek V3.1deepseek-v3-1

DeepSeek

33%

Ministral 3 8B (Reasoning)ministral-3-8b-reasoning

Mistral

33%

GPT-OSS 20Bgpt-oss-20b

OpenAI

31%

Mistral 7B v0.3mistral-7b-v0-3

Mistral

30%

100

Ministral 3 8Bministral-3-8b

Mistral

30%

101

Mistral 8x7B v0.2mistral-8x7b-v0-2

Mistral

29%

102

LFM2.5-1.2B-Thinkinglfm2-5-1-2b-thinking

LiquidAI

28%

103

Ministral 3 3B (Reasoning)ministral-3-3b-reasoning

Mistral

27%

104

LFM2.5-1.2B-Instructlfm2-5-1-2b-instruct

LiquidAI

24%

105

Ministral 3 3Bministral-3-3b

Mistral

23%

FAQ

What does AIME 2023 measure?

A 15-question, 3-hour examination where each answer is an integer from 000 to 999. Serves as the intermediate step between AMC 10/12 and the USA Mathematical Olympiad (USAMO).

Which model leads the published AIME 2023 snapshot?

GPT-5.1-Codex-Max currently leads the published AIME 2023 snapshot with 99% tracked score. BenchLM shows this benchmark for display only and does not use it in overall rankings.

How many models are evaluated on AIME 2023?

105 AI models are included in BenchLM's mirrored AIME 2023 snapshot, based on the public leaderboard captured on June 2, 2026.

Learn More

Read our explainer: AIME 2023 benchmark deep dive

Last updated: June 2, 2026 · mirrored from the public benchmark leaderboard

The AI models change fast. We track them for you.

For engineers, researchers, and the plain curious — a weekly brief on new models, ranking shifts, and pricing changes.

Free. No spam. Unsubscribe anytime.