Artificial Analysis Humanity's Last Exam (AA-HLE)

Name: Artificial Analysis Humanity's Last Exam
Creator: BenchLM

A display-only Artificial Analysis Humanity's Last Exam score.

Benchmark score on AA-HLE — July 4, 2026

BenchLM mirrors the published score view for AA-HLE. Claude Opus 4.8 leads the public snapshot at 45.7% , followed by Gemini 3.1 Pro (44.7%) and GPT-5.5 (44.3%). BenchLM does not use these results to rank models overall.

1Closed

Claude Opus 4.8

Anthropic

45.7%

Overall 85Context 1M

2Closed

Gemini 3.1 Pro

Google

44.7%

Overall 88Context 1M

3Closed

GPT-5.5

OpenAI

44.3%

Overall 78Context 1M

132 modelsKnowledgeCurrentDisplay onlyUpdated July 4, 2026

The published AA-HLE snapshot is tightly clustered at the top: Claude Opus 4.8 sits at 45.7%, while the third row is only 1.4 points behind. The broader top-10 spread is 7.6 points, so many of the published scores sit in a relatively narrow band.

132 models have been evaluated on AA-HLE. The benchmark falls in the Knowledge category. This category carries a 12% weight in BenchLM.ai's overall scoring system. AA-HLE is currently displayed for reference but excluded from the scoring formula, so it does not directly affect overall rankings.

About AA-HLE

Year

2026

Tasks

Expert-level questions

Format

Accuracy

Difficulty

Frontier expert reasoning

BenchLM stores the Artificial Analysis HLE result separately from the weighted HLE lane so AA refreshes remain display-only.

Artificial Analysis Humanity's Last Exam Benchmark Leaderboard

BenchLM freshness & provenance

Version

AA-HLE 2026

Refresh cadence

Quarterly

Staleness state

Current

Question availability

Public benchmark set

CurrentDisplay only

BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.

Benchmark score table (132 models)

Claude Opus 4.8

AnthropicClosed

45.7%

Gemini 3.1 Pro

GoogleClosed

44.7%

GPT-5.5

OpenAIClosed

44.3%

GPT-5.4

OpenAIClosed

41.6%

Gemini 3.5 Flash

GoogleClosed

41.0%

GLM-5.2

Z.AIOpen

40.1%

GPT-5.3 Codex

OpenAIClosed

39.9%

Muse Spark

MetaClosed

39.9%

Claude Opus 4.7 (Adaptive)

AnthropicClosed

39.6%

Qwen3.7 Max

AlibabaClosed

38.1%

Gemini 3 Pro

GoogleClosed

37.2%

MiniMax M3

MiniMaxOpen

37.1%

Claude Opus 4.6 (Adaptive)

AnthropicClosed

36.7%

DeepSeek V4 Pro (Max)

DeepSeekOpen

35.9%

Kimi K2.6

Moonshot AIOpen

35.9%

GPT-5.2

OpenAIClosed

35.4%

Grok 4.3

xAIClosed

35.0%

MiMo-V2.5-Pro

XiaomiClosed

33.8%

DeepSeek V4 Pro (High)

DeepSeekOpen

33.5%

GPT-5.2-Codex

OpenAIClosed

33.5%

Qwen3.7 Plus

AlibabaClosed

33.4%

Kimi K2.7 Code

Moonshot AIOpen

32.8%

DeepSeek V4 Flash (Max)

DeepSeekOpen

32.1%

Claude Opus 4.7

AnthropicClosed

31.2%

Kimi K2.5 (Reasoning)

Moonshot AIClosed

29.4%

Kimi K2.5

Moonshot AIOpen

29.4%

Qwen 3.6 Max (preview)

AlibabaClosed

28.9%

Claude Opus 4.5 Thinking

AnthropicClosed

28.4%

MiMo-V2-Pro

XiaomiClosed

28.3%

MiniMax M2.7

MiniMaxOpen

28.1%

GLM-5.1

Z.AIOpen

28.0%

DeepSeek V4 Flash (High)

DeepSeekOpen

27.8%

Qwen3.5 397B (Reasoning)

AlibabaOpen

27.3%

GLM-5

Z.AIOpen

27.2%

Nemotron 3 Ultra

NVIDIAOpen

26.6%

GPT-5.4 mini

OpenAIClosed

26.6%

GPT-5.1

OpenAIClosed

26.5%

GPT-5 (high)

OpenAIClosed

26.5%

GPT-5.4 nano

OpenAIClosed

26.5%

Qwen3.6 Plus

AlibabaClosed

25.7%

Hy3 Preview

TencentOpen

25.5%

GLM-5-Turbo

Z.AIClosed

25.4%

GLM-4.7

Z.AIOpen

25.1%

Grok 4

xAIClosed

23.9%

GPT-5 (medium)

OpenAIClosed

23.5%

GPT-5.1-Codex-Max

OpenAIClosed

23.4%

Qwen3.5-122B-A10B

AlibabaOpen

23.4%

GPT-5.1-Codex

OpenAIClosed

23.4%

Gemma 4 31B

GoogleOpen

22.7%

Qwen3.5-27B

AlibabaOpen

22.2%

Qwen3.6-27B

AlibabaOpen

21.6%

Gemini 2.5 Pro

GoogleClosed

21.1%

Qwen3.6-35B-A3B

AlibabaOpen

20.2%

OpenAIClosed

20.0%

MiMo-V2-Omni

XiaomiClosed

19.9%

Step 3.7 Flash

StepFunOpen

19.9%

Qwen3.5-35B-A3B

AlibabaOpen

19.7%

Qwen3.5 397B

AlibabaOpen

18.8%

Claude Opus 4.6

AnthropicClosed

18.6%

GPT-OSS 120B

OpenAIOpen

18.5%

Gemma 4 26B A4B

GoogleOpen

18.3%

Grok 4.1 Fast (Reasoning)

xAIClosed

17.6%

Grok 4 Fast (Reasoning)

xAIClosed

17.0%

Gemini 3.1 Flash-Lite

GoogleClosed

16.2%

GLM-5V-Turbo

Z.AIClosed

15.8%

DeepSeek-R1

DeepSeekOpen

14.9%

Gemma 4 12B

GoogleOpen

14.8%

Trinity-Large-Thinking

Arcee AIOpen

14.7%

Trinity-Large-Preview

Arcee AIOpen

14.7%

Gemini 3 Flash

GoogleClosed

14.1%

Claude Sonnet 4.6

AnthropicClosed

13.2%

K-Exaone

LG AI ResearchClosed

13.1%

DeepSeek V3.1 (Reasoning)

DeepSeekOpen

13.0%

Claude Opus 4.5

AnthropicClosed

12.9%

Mistral Medium 3.5 128B

MistralOpen

12.8%

Claude 4.1 Opus Thinking

AnthropicClosed

11.9%

Command A+

CohereOpen

11.4%

Qwen3 Max

AlibabaClosed

11.1%

DeepSeek V3.2

DeepSeekOpen

10.5%

Sarvam 105B

SarvamOpen

10.1%

GPT-OSS 20B

OpenAIOpen

9.8%

Mistral Small 4 (Reasoning)

MistralOpen

9.5%

Mistral Small 4

MistralOpen

9.5%

o3-mini

OpenAIClosed

8.7%

Nemotron Ultra 253B

NVIDIAOpen

8.1%

MiMo-V2-Flash

XiaomiOpen

8.0%

OpenAIClosed

7.7%

Grok Code Fast 1

xAIClosed

7.5%

Kimi K2

Moonshot AIClosed

7.0%

Sarvam 30B

SarvamOpen

7.0%

LFM2.5-8B-A1B

LiquidAIOpen

6.9%

GLM-4.5-Air

Z.AIClosed

6.8%

Granite-4.0-H-350M

IBMOpen

6.4%

DeepSeek V3.1

DeepSeekOpen

6.3%

Ling 2.6 Flash

InclusionAIOpen

6.2%

Exaone 4.0 1.2B

LG AI ResearchOpen

5.8%

Granite-4.0-350M

IBMOpen

5.7%

DeepSeek R1 Distill Qwen 32B

DeepSeekOpen

5.5%

Nemotron 3 Nano Omni 30B A3B

NVIDIAOpen

5.3%

100

GLM-4.6

Z.AIOpen

5.2%

101

Gemini 2.5 Flash

GoogleClosed

5.1%

102

Granite-4.0-1B

IBMOpen

5.1%

103

LFM2.5-VL-1.6B-Extract

LiquidAIOpen

5.1%

104

Grok 4.1 Fast

xAIClosed

5.0%

105

Granite-4.0-H-1B

IBMOpen

5.0%

106

Gemini 1.5 Pro

GoogleClosed

4.9%

107

Exaone 4.0 32B

LG AI ResearchOpen

4.9%

108

Llama 4 Maverick

MetaOpen

4.8%

109

Gemma 4 E2B

GoogleOpen

4.8%

110

Gemma 3 27B

GoogleOpen

4.7%

111

GPT-4.1

OpenAIClosed

4.6%

112

GPT-4.1 mini

OpenAIClosed

4.6%

113

Nemotron 3 Nano 30B

NVIDIAOpen

4.6%

114

Gemini 1.0 Pro

GoogleClosed

4.6%

115

Llama 4 Scout

MetaOpen

4.3%

116

Mistral Medium 3

MistralClosed

4.3%

117

Llama 3.1 405B

MetaOpen

4.2%

118

Mistral Large 3

MistralClosed

4.1%

119

Phi-4

MicrosoftOpen

4.1%

120

Claude 4 Sonnet

AnthropicClosed

4.0%

121

GPT-4o mini

OpenAIClosed

4.0%

122

Mistral Large 2

MistralClosed

4.0%

123

GPT-4.1 nano

OpenAIClosed

3.9%

124

Claude 3 Haiku

AnthropicClosed

3.9%

125

Qwen2.5 Coder 32B Instruct

AlibabaOpen

3.8%

126

Solar Pro 2

UpstageClosed

3.8%

127

Gemma 4 E4B

GoogleOpen

3.7%

128

DeepSeek V3

DeepSeekOpen

3.6%

129

Nova Pro

AmazonClosed

3.4%

130

GPT-4o

OpenAIClosed

3.3%

131

GPT-4 Turbo

OpenAIClosed

3.3%

132

Claude 3 Opus

AnthropicClosed

3.1%

FAQ

What does AA-HLE measure?

A display-only Artificial Analysis Humanity's Last Exam score.

Which model scores highest on AA-HLE?

Claude Opus 4.8 by Anthropic currently leads with a score of 45.7% on AA-HLE.

How many models are evaluated on AA-HLE?

132 AI models have been evaluated on AA-HLE on BenchLM.

Compare Top Models on AA-HLE

Claude Opus 4.8 vs Gemini 3.1 Pro Gemini 3.1 Pro vs GPT-5.5 GPT-5.5 vs GPT-5.4 GPT-5.4 vs Gemini 3.5 Flash

Last updated: July 4, 2026 · BenchLM version AA-HLE 2026

The AI models change fast. We track them for you.

For engineers, researchers, and the plain curious — a weekly brief on new models, ranking shifts, and pricing changes.

Free. No spam. Unsubscribe anytime.