Humanity's Last Exam is crowdsourced from thousands of domain experts and designed to probe the absolute frontier of AI. Top models score only 10-46%. Here's why HLE matters.
Humanity's Last Exam (HLE) is the hardest public AI benchmark available. While frontier models score 95-99% on most knowledge benchmarks, HLE scores range from the single digits to the mid-40s. It's the one benchmark where the gap between models is impossible to ignore.
In a landscape where MMLU and even GPQA are approaching saturation, HLE remains the clearest measure of where frontier AI actually stands — and where it falls short.
HLE was crowdsourced from thousands of domain experts worldwide, organized by the Center for AI Safety and Scale AI. The questions are designed to:
This isn't a test of whether a model can recall facts. It's a test of whether a model can reason at the level of the world's top researchers.
HLE's question creation process is unprecedented in scale. Over 3,000 domain experts from top universities and research institutions contributed questions. Each question goes through multiple validation rounds:
The result is a benchmark that probes knowledge most humans — even highly educated ones — simply don't have. A question about recent advances in algebraic topology, the biochemistry of a newly discovered enzyme, or the implications of a 2025 physics paper is the kind of content HLE includes.
HLE has the widest spread of any benchmark we track:
| Model | HLE Score |
|---|---|
| GPT-5.4 | 46 |
| GPT-5.3 Codex | 44 |
| GPT-5.2 | 40 |
| Claude Opus 4.6 | 38 |
| Gemini 3.1 Pro | 35 |
Full leaderboard: HLE scores
The 11-point gap between GPT-5.4 (46) and Gemini 3.1 Pro (35) is massive. On MMLU, these models are within 2 points of each other. HLE reveals differences that other benchmarks can't see.
Even the best model scores below 50%. This tells us something important: current AI models have genuine limitations in deep expert reasoning. They're excellent at processing known information but still struggle with questions that require true expert-level insight.
Several factors contribute to the low scores:
Many HLE questions reference findings published after a model's training cutoff. A question about a 2025 theorem proof or a recent experimental result can't be answered from training data alone — it requires genuine reasoning about unfamiliar material.
Models trained on internet-scale data have extraordinary breadth. They know something about almost everything. But HLE tests depth — the kind of expertise that takes years of focused study in a narrow field. Current models are wide but not always deep enough.
The hardest HLE questions require chaining multiple pieces of specialist knowledge together. A physics question might require combining quantum field theory with statistical mechanics in a way that even PhD students find challenging. Models often get the individual pieces right but fail to connect them.
Unlike benchmarks with well-known problem formats, HLE includes question types models haven't been trained to handle. This tests genuine adaptation rather than pattern matching.
To understand HLE's value, compare it with the other major knowledge benchmarks:
| Benchmark | Top Score | Score Spread | Saturation Risk | Best For |
|---|---|---|---|---|
| MMLU | 93 | 70-93 | High | General knowledge baseline |
| MMLU-Pro | 87 | 50-87 | Medium | Harder multiple choice |
| GPQA | 97 | 80-97 | High | PhD-level science (3 domains) |
| SuperGPQA | 95 | 55-95 | Medium | PhD-level (285 domains) |
| HLE | 46 | 10-46 | Very low | Frontier reasoning |
HLE is the only benchmark where no model has crossed 50%. This means it will remain useful for differentiating models for years — possibly the entire next generation of AI development.
HLE is most useful when:
HLE is less useful for:
The fact that the best AI scores below 50% on HLE is often cited as evidence that AI has "a long way to go." But context matters:
HLE is the single most important benchmark for tracking AI progress at the frontier. While other benchmarks have become checkboxes (score 90+ and move on), HLE still has room to differentiate. Every new frontier model release, the first question is: "What did it score on HLE?"
For broader knowledge assessment, pair HLE with MMLU-Pro and GPQA. For a complete picture including coding and math, see our overall model rankings or use the LLM Selector Quiz to find the best model for your specific needs.
Data from BenchLM.ai. Last updated March 2026.
On this page
New model releases, benchmark scores, and leaderboard changes. Every Friday.
Free. Unsubscribe anytime.
AIME and HMMT are high school math olympiad competitions now used to benchmark AI. Frontier models score 95-99% — competition math is effectively solved. Here's what that means.
We ranked every major LLM by coding benchmarks — HumanEval, SWE-bench Verified, and LiveCodeBench. Here's which model actually comes out on top, and why the answer depends on what you're building.
Chatbot Arena ranks AI models using Elo ratings from blind human preference votes. Here's how the system works, what the scores mean, and how Elo compares to standardized benchmarks.