Humanity's Last Exam is crowdsourced from thousands of domain experts and designed to probe the absolute frontier of AI. Top models score only 10-46%. Here's why HLE matters.
Share This Report
Copy the link, post it, or save a PDF version.
HLE is the hardest public AI benchmark available. Frontier models score 95-99% on most knowledge tests — on HLE, the best score is 46%. The 11-point gap between first and fifth place reveals performance differences that every other knowledge benchmark masks. If you want to know where frontier AI actually stands, HLE is the only benchmark that still has room to tell you.
Humanity's Last Exam (HLE) is the hardest public AI benchmark available. While frontier models score 95-99% on most knowledge benchmarks, HLE scores range from the single digits to the mid-40s. It's the one benchmark where the gap between models is impossible to ignore.
In a landscape where MMLU and even GPQA are approaching saturation, HLE remains the clearest measure of where frontier AI actually stands — and where it falls short.
HLE was crowdsourced from thousands of domain experts worldwide, organized by the Center for AI Safety and Scale AI. The questions are designed to:
This isn't a test of whether a model can recall facts. It's a test of whether a model can reason at the level of the world's top researchers.
HLE's question creation process is unprecedented in scale. Over 3,000 domain experts from top universities and research institutions contributed questions. Each question goes through multiple validation rounds:
The result is a benchmark that probes knowledge most humans — even highly educated ones — simply don't have.
HLE has the widest spread of any benchmark we track:
| Model | HLE Score |
|---|---|
| GPT-5.4 | 46 |
| GPT-5.3 Codex | 44 |
| GPT-5.2 | 40 |
| Claude Opus 4.6 | 38 |
| Gemini 3.1 Pro | 35 |
Full leaderboard: HLE scores
The 11-point gap between GPT-5.4 (46) and Gemini 3.1 Pro (35) is massive. On MMLU, these models are within 2 points of each other. HLE reveals differences that other benchmarks can't see.
Even the best model scores below 50%. This tells us something important: current AI models have genuine limitations in deep expert reasoning. They're excellent at processing known information but still struggle with questions that require true expert-level insight.
Several factors contribute to the low scores:
Many HLE questions reference findings published after a model's training cutoff. A question about a 2025 theorem proof or a recent experimental result can't be answered from training data alone — it requires genuine reasoning about unfamiliar material.
Models trained on internet-scale data have extraordinary breadth. They know something about almost everything. But HLE tests depth — the kind of expertise that takes years of focused study in a narrow field. Current models are wide but not always deep enough.
The hardest HLE questions require chaining multiple pieces of specialist knowledge together. A physics question might require combining quantum field theory with statistical mechanics in a way that even PhD students find challenging. Models often get the individual pieces right but fail to connect them.
| Benchmark | Top Score | Score Spread | Saturation Risk | Best For |
|---|---|---|---|---|
| MMLU | 93 | 70-93 | High | General knowledge baseline |
| MMLU-Pro | 87 | 50-87 | Medium | Harder multiple choice |
| GPQA | 97 | 80-97 | High | PhD-level science (3 domains) |
| SuperGPQA | 95 | 55-95 | Medium | PhD-level (285 domains) |
| HLE | 46 | 10-46 | Very low | Frontier reasoning |
HLE is the only benchmark where no model has crossed 50%. This means it will remain useful for differentiating models for years.
HLE is most useful when:
HLE is less useful for evaluating mid-tier models (most score in single digits) or for predicting performance on typical business tasks.
HLE is the single most important benchmark for tracking AI progress at the frontier. While other benchmarks have become checkboxes, HLE still has room to differentiate.
→ See all models on the full leaderboard · Knowledge rankings
What is HLE (Humanity's Last Exam)? HLE is an AI benchmark crowdsourced from 3,000+ domain experts at top universities. It tests frontier-level knowledge across 100+ fields — advanced math, theoretical physics, philosophy, and more. No model has crossed 50%: the top score as of March 2026 is 46 (GPT-5.4).
What score does GPT-5.4 get on HLE? GPT-5.4 scores 46 on HLE, the highest tracked by BenchLM.ai. GPT-5.3 Codex: 44, GPT-5.2: 40, Claude Opus 4.6: 38, Gemini 3.1 Pro: 35. See the HLE leaderboard for current rankings.
Why do top AI models score so low on HLE? HLE questions reference recent research (post training cutoff), require chaining specialist knowledge across fields, and use novel problem structures. Even human domain experts score ~74% on questions in their specialty. AI models have broad knowledge but not always the depth these questions demand.
How is HLE different from GPQA and MMLU? MMLU frontier models score 97-99% (saturated). GPQA top scores 95-97% (approaching saturation). HLE top scores 10-46% (plenty of room). HLE provides far more differentiation between frontier models than either alternative.
Is HLE a good benchmark for comparing AI models? Yes — for frontier models, it is the best public benchmark. The 11-point gap between GPT-5.4 and Gemini 3.1 Pro would be invisible on MMLU. Less useful for mid-tier models, which mostly score in single digits.
Data from BenchLM.ai. Last updated March 2026.
New models drop every week. We send one email a week with what moved and why.
Share This Report
Copy the link, post it, or save a PDF version.
On this page
Which models moved up, what’s new, and what it costs. One email a week, 3-min read.
Free. One email per week.
GPQA tests AI models with graduate-level questions in biology, physics, and chemistry that are 'Google-proof' — even skilled non-experts with internet access can't answer them. Here's how it works.
MMLU and MMLU-Pro are the most cited knowledge benchmarks in AI. Here's what each measures, why MMLU is saturated, and why MMLU-Pro is the better discriminator in 2026.
React Native Evals measures whether AI coding models can complete real React Native implementation tasks across navigation, animation, and async state. Here's what it tests, why it matters, and how it differs from SWE-bench and LiveCodeBench.