Skip to main content

Artificial Analysis Humanity's Last Exam (AA-HLE)

A display-only Artificial Analysis Humanity's Last Exam score.

Benchmark score on AA-HLE — July 4, 2026

BenchLM mirrors the published score view for AA-HLE. Claude Opus 4.8 leads the public snapshot at 45.7% , followed by Gemini 3.1 Pro (44.7%) and GPT-5.5 (44.3%). BenchLM does not use these results to rank models overall.

132 modelsKnowledgeCurrentDisplay onlyUpdated July 4, 2026

The published AA-HLE snapshot is tightly clustered at the top: Claude Opus 4.8 sits at 45.7%, while the third row is only 1.4 points behind. The broader top-10 spread is 7.6 points, so many of the published scores sit in a relatively narrow band.

132 models have been evaluated on AA-HLE. The benchmark falls in the Knowledge category. This category carries a 12% weight in BenchLM.ai's overall scoring system. AA-HLE is currently displayed for reference but excluded from the scoring formula, so it does not directly affect overall rankings.

About AA-HLE

Year

2026

Tasks

Expert-level questions

Format

Accuracy

Difficulty

Frontier expert reasoning

BenchLM stores the Artificial Analysis HLE result separately from the weighted HLE lane so AA refreshes remain display-only.

BenchLM freshness & provenance

Version

AA-HLE 2026

Refresh cadence

Quarterly

Staleness state

Current

Question availability

Public benchmark set

CurrentDisplay only

BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.

Benchmark score table (132 models)

1
45.7%
2
44.7%
3
44.3%
4
41.6%
5
41.0%
6
40.1%
7
39.9%
8
39.9%
9
39.6%
10
38.1%
11
37.2%
12
37.1%
13
36.7%
14
35.9%
15
35.9%
16
35.4%
17
35.0%
18
33.8%
19
33.5%
20
33.5%
21
33.4%
22
32.8%
23
32.1%
24
31.2%
25
29.4%
26
29.4%
27
28.9%
28
28.4%
29
28.3%
30
28.1%
31
28.0%
32
27.8%
33
27.3%
34
27.2%
35
26.6%
36
26.6%
37
26.5%
38
26.5%
39
26.5%
40
25.7%
41
25.5%
42
25.4%
43
25.1%
44
23.9%
45
23.5%
46
23.4%
47
23.4%
48
23.4%
49
22.7%
50
22.2%
51
21.6%
52
21.1%
53
20.2%
54
20.0%
55
19.9%
56
19.9%
57
19.7%
58
18.8%
59
18.6%
60
18.5%
61
18.3%
63
17.0%
64
16.2%
65
15.8%
66
14.9%
67
14.8%
68
14.7%
69
14.7%
70
14.1%
71
13.2%
72
13.1%
73
13.0%
74
12.9%
75
12.8%
76
11.9%
77
11.4%
78
11.1%
79
10.5%
80
10.1%
81
9.8%
83
9.5%
84
8.7%
85
8.1%
86
8.0%
87
7.7%
88
7.5%
89
7.0%
90
7.0%
91
6.9%
92
6.8%
93
6.4%
94
6.3%
95
6.2%
96
5.8%
97
5.7%
100
5.2%
101
5.1%
102
5.1%
103
5.1%
104
5.0%
105
5.0%
106
4.9%
107
4.9%
108
4.8%
109
4.8%
110
4.7%
111
4.6%
112
4.6%
113
4.6%
114
4.6%
115
4.3%
116
4.3%
117
4.2%
118
4.1%
119
4.1%
120
4.0%
121
4.0%
122
4.0%
123
3.9%
124
3.9%
125
3.8%
126
3.8%
127
3.7%
128
3.6%
129
3.4%
130
3.3%
131
3.3%
132
3.1%

FAQ

What does AA-HLE measure?

A display-only Artificial Analysis Humanity's Last Exam score.

Which model scores highest on AA-HLE?

Claude Opus 4.8 by Anthropic currently leads with a score of 45.7% on AA-HLE.

How many models are evaluated on AA-HLE?

132 AI models have been evaluated on AA-HLE on BenchLM.

Last updated: July 4, 2026 · BenchLM version AA-HLE 2026

The AI models change fast. We track them for you.

For engineers, researchers, and the plain curious — a weekly brief on new models, ranking shifts, and pricing changes.

Free. No spam. Unsubscribe anytime.