Skip to main content

Artificial Analysis Humanity's Last Exam (AA-HLE)

A display-only Artificial Analysis Humanity's Last Exam score.

Benchmark score on AA-HLE — May 21, 2026

BenchLM mirrors the published score view for AA-HLE. Gemini 3.1 Pro leads the public snapshot at 44.7% , followed by GPT-5.5 (44.3%) and GPT-5.4 (41.6%). BenchLM does not use these results to rank models overall.

125 modelsKnowledgeCurrentDisplay onlyUpdated May 21, 2026

The published AA-HLE snapshot is tightly clustered at the top: Gemini 3.1 Pro sits at 44.7%, while the third row is only 3.1 points behind. The broader top-10 spread is 8.0 points, so many of the published scores sit in a relatively narrow band.

125 models have been evaluated on AA-HLE. The benchmark falls in the Knowledge category. This category carries a 12% weight in BenchLM.ai's overall scoring system. AA-HLE is currently displayed for reference but excluded from the scoring formula, so it does not directly affect overall rankings.

About AA-HLE

Year

2026

Tasks

Expert-level questions

Format

Accuracy

Difficulty

Frontier expert reasoning

BenchLM stores the Artificial Analysis HLE result separately from the weighted HLE lane so AA refreshes remain display-only.

BenchLM freshness & provenance

Version

AA-HLE 2026

Refresh cadence

Quarterly

Staleness state

Current

Question availability

Public benchmark set

CurrentDisplay only

BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.

Benchmark score table (125 models)

1
44.7%
2
44.3%
3
41.6%
4
41.0%
5
39.9%
6
39.9%
7
39.6%
8
38.1%
9
37.2%
10
36.7%
11
35.9%
12
35.9%
13
35.4%
14
35.0%
15
33.8%
16
33.5%
17
33.5%
18
32.1%
19
31.2%
20
29.4%
21
29.4%
22
28.9%
23
28.4%
24
28.3%
25
28.1%
26
28.0%
27
27.8%
28
27.3%
29
27.2%
30
27.2%
31
26.6%
32
26.5%
33
26.5%
34
26.5%
35
25.7%
36
25.5%
37
25.4%
38
25.1%
39
24.2%
40
23.9%
41
23.5%
42
23.4%
43
23.4%
44
23.4%
45
22.7%
46
22.2%
47
21.6%
48
21.1%
49
20.2%
50
20.0%
51
19.9%
52
19.7%
53
18.8%
54
18.6%
55
18.5%
56
18.3%
58
17.0%
59
16.2%
60
15.8%
61
14.9%
62
14.7%
63
14.7%
64
14.1%
65
13.2%
66
13.1%
67
13.0%
68
12.9%
69
12.8%
70
11.9%
71
11.4%
72
11.1%
73
11.1%
74
10.5%
75
10.1%
76
9.8%
78
9.5%
79
8.7%
80
8.1%
81
8.0%
82
7.7%
83
7.5%
84
7.0%
85
7.0%
86
6.8%
87
6.4%
88
6.3%
89
6.2%
90
5.8%
91
5.7%
94
5.2%
95
5.1%
96
5.1%
97
5.0%
98
5.0%
99
4.9%
100
4.9%
101
4.8%
102
4.8%
103
4.7%
104
4.6%
105
4.6%
106
4.6%
107
4.6%
108
4.3%
109
4.3%
110
4.2%
111
4.1%
112
4.1%
113
4.0%
114
4.0%
115
4.0%
116
3.9%
117
3.9%
118
3.8%
119
3.8%
120
3.7%
121
3.6%
122
3.4%
123
3.3%
124
3.3%
125
3.1%

FAQ

What does AA-HLE measure?

A display-only Artificial Analysis Humanity's Last Exam score.

Which model scores highest on AA-HLE?

Gemini 3.1 Pro by Google currently leads with a score of 44.7% on AA-HLE.

How many models are evaluated on AA-HLE?

125 AI models have been evaluated on AA-HLE on BenchLM.

Last updated: May 21, 2026 · BenchLM version AA-HLE 2026

The AI models change fast. We track them for you.

For engineers, researchers, and the plain curious — a weekly brief on new models, ranking shifts, and pricing changes.

Free. No spam. Unsubscribe anytime.