An extremely challenging benchmark crowd-sourced from thousands of domain experts worldwide, designed to probe the absolute frontier of AI capabilities with questions that even specialists find difficult.
Year
2025
Tasks
Expert-level questions
Format
Open-ended and multiple choice
Difficulty
Frontier expert level
HLE represents the hardest public benchmark available, with top models scoring only 10-45%. Questions span advanced mathematics, theoretical physics, philosophy, and other fields at the cutting edge of human knowledge.
Humanity's Last ExamAn extremely challenging benchmark crowd-sourced from thousands of domain experts worldwide, designed to probe the absolute frontier of AI capabilities with questions that even specialists find difficult.
GPT-5.4 by OpenAI currently leads with a score of 46 on HLE.
88 AI models have been evaluated on HLE on BenchLM.