Skip to main content

Humanity's Last Exam (HLE)

An extremely challenging benchmark crowd-sourced from thousands of domain experts worldwide, designed to probe the absolute frontier of AI capabilities with questions that even specialists find difficult.

Top models on HLE — April 21, 2026

As of April 21, 2026, Claude Mythos Preview leads the HLE leaderboard with 64.7% , followed by GPT-5.4 Pro (58.7%) and Claude Opus 4.7 (54.7%).

20 modelsKnowledge23% of category scoreCurrentUpdated April 21, 2026

According to BenchLM.ai, Claude Mythos Preview leads the HLE benchmark with a score of 64.7%, followed by GPT-5.4 Pro (58.7%) and Claude Opus 4.7 (54.7%). There is significant spread across the leaderboard, making this benchmark effective at differentiating model capabilities.

20 models have been evaluated on HLE. The benchmark falls in the Knowledge category. This category carries a 12% weight in BenchLM.ai's overall scoring system. Within that category, HLE contributes 23% of the category score, so strong performance here directly affects a model's overall ranking.

About HLE

Year

2025

Tasks

Expert-level questions

Format

Open-ended and multiple choice

Difficulty

Frontier expert level

HLE represents the hardest public benchmark available, with top models scoring only 10-45%. Questions span advanced mathematics, theoretical physics, philosophy, and other fields at the cutting edge of human knowledge.

BenchLM freshness & provenance

Version

Humanity's Last Exam

Refresh cadence

Static

Staleness state

Current

Question availability

Public benchmark set

Current

BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.

Leaderboard (20 models)

1
64.7%
2
58.7%
3
54.7%
4
53%
5
52.3%
6
50.4%
7
50.4%
8
49%
9
41.5%
10
37.7%
11
34.7%
12
30.8%
13
30.1%
14
28.8%
15
28.7%
16
26.5%
17
24.8%
18
21.4%
19
18.8%
20
17.2%

FAQ

What does HLE measure?

An extremely challenging benchmark crowd-sourced from thousands of domain experts worldwide, designed to probe the absolute frontier of AI capabilities with questions that even specialists find difficult.

Which model scores highest on HLE?

Claude Mythos Preview by Anthropic currently leads with a score of 64.7% on HLE.

How many models are evaluated on HLE?

20 AI models have been evaluated on HLE on BenchLM.

Last updated: April 21, 2026 · BenchLM version Humanity's Last Exam

The AI models change fast. We track them for you.

For engineers, researchers, and the plain curious — a weekly brief on new models, ranking shifts, and pricing changes.

Free. No spam. Unsubscribe anytime.