Humanity's Last Exam (HLE)

Name: Humanity's Last Exam
Creator: BenchLM

An extremely challenging benchmark crowd-sourced from thousands of domain experts worldwide, designed to probe the absolute frontier of AI capabilities with questions that even specialists find difficult.

Top models on HLE — June 2, 2026

As of June 2, 2026, Claude Mythos Preview leads the HLE leaderboard with 64.7% , followed by GPT-5.4 Pro (58.7%) and Claude Opus 4.8 (57.9%).

1Closed

Claude Mythos Preview

Anthropic

64.7%

Overall 99Context 1M

2Closed

GPT-5.4 Pro

OpenAI

58.7%

Overall 91Context 1.05M

3Closed

Claude Opus 4.8

Anthropic

57.9%

Overall 95Context 1M

36 modelsKnowledge23% of category scoreCurrentUpdated June 2, 2026

According to BenchLM.ai, Claude Mythos Preview leads the HLE benchmark with a score of 64.7%, followed by GPT-5.4 Pro (58.7%) and Claude Opus 4.8 (57.9%). The scores show moderate spread, with meaningful differences between the top tier and mid-tier models.

36 models have been evaluated on HLE. The benchmark falls in the Knowledge category. This category carries a 12% weight in BenchLM.ai's overall scoring system. Within that category, HLE contributes 23% of the category score, so strong performance here directly affects a model's overall ranking.

About HLE

Year

2025

Tasks

Expert-level questions

Format

Open-ended and multiple choice

Difficulty

Frontier expert level

HLE represents the hardest public benchmark available, with top models scoring only 10-45%. Questions span advanced mathematics, theoretical physics, philosophy, and other fields at the cutting edge of human knowledge.

Humanity's Last Exam

BenchLM freshness & provenance

Version

Humanity's Last Exam

Refresh cadence

Static

Staleness state

Current

Question availability

Public benchmark set

Current

BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.