An extremely challenging benchmark crowd-sourced from thousands of domain experts worldwide, designed to probe the absolute frontier of AI capabilities with questions that even specialists find difficult.
As of April 21, 2026, Claude Mythos Preview leads the HLE leaderboard with 64.7% , followed by GPT-5.4 Pro (58.7%) and Claude Opus 4.7 (54.7%).
Claude Mythos Preview
Anthropic
GPT-5.4 Pro
OpenAI
Claude Opus 4.7
Anthropic
According to BenchLM.ai, Claude Mythos Preview leads the HLE benchmark with a score of 64.7%, followed by GPT-5.4 Pro (58.7%) and Claude Opus 4.7 (54.7%). There is significant spread across the leaderboard, making this benchmark effective at differentiating model capabilities.
20 models have been evaluated on HLE. The benchmark falls in the Knowledge category. This category carries a 12% weight in BenchLM.ai's overall scoring system. Within that category, HLE contributes 23% of the category score, so strong performance here directly affects a model's overall ranking.
Year
2025
Tasks
Expert-level questions
Format
Open-ended and multiple choice
Difficulty
Frontier expert level
HLE represents the hardest public benchmark available, with top models scoring only 10-45%. Questions span advanced mathematics, theoretical physics, philosophy, and other fields at the cutting edge of human knowledge.
Version
Humanity's Last Exam
Refresh cadence
Static
Staleness state
Current
Question availability
Public benchmark set
BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.
An extremely challenging benchmark crowd-sourced from thousands of domain experts worldwide, designed to probe the absolute frontier of AI capabilities with questions that even specialists find difficult.
Claude Mythos Preview by Anthropic currently leads with a score of 64.7% on HLE.
20 AI models have been evaluated on HLE on BenchLM.
For engineers, researchers, and the plain curious — a weekly brief on new models, ranking shifts, and pricing changes.
Free. No spam. Unsubscribe anytime.