HLE (Humanity's Last Exam): The Hardest Benchmark

Q: What is HLE (Humanity's Last Exam)?

HLE (Humanity's Last Exam) is an AI benchmark crowdsourced from over 3,000 domain experts at top universities and research institutions. It contains 3,000+ questions at the frontier of human knowledge — advanced mathematics, theoretical physics, philosophy, and other cutting-edge fields. Unlike most benchmarks, no model has crossed 50%: the top score as of March 2026 is 46 (GPT-5.4).

Q: What score does GPT-5.4 get on HLE?

GPT-5.4 scores 46 on HLE as of March 2026, the highest of any model tracked by BenchLM.ai. GPT-5.3 Codex scores 44, GPT-5.2 scores 40, Claude Opus 4.6 scores 38, and Gemini 3.1 Pro scores 35. The 11-point gap between first and fifth place is the largest spread of any knowledge benchmark — and no model has broken 50%.

Q: Why do top AI models score so low on HLE?

HLE questions are at the frontier of human knowledge — many reference findings from after the model's training cutoff, require chaining multiple pieces of specialist knowledge, or test novel problem structures the model hasn't been trained to handle. Even human domain experts score only about 74% on questions in their own specialty. AI models have extraordinary breadth but not always the depth needed for these questions.

Q: How is HLE different from GPQA and MMLU?

MMLU covers 57 general subjects — frontier models score 97-99% (saturated). GPQA Diamond tests 3 science domains with PhD-level questions — top models score 95-97% (approaching saturation). HLE covers 100+ domains at the frontier of human knowledge — top models score only 10-46%, with no sign of saturation. HLE provides the most differentiation between frontier models.

Q: Is HLE a good benchmark for comparing AI models?

HLE is the best public benchmark for comparing frontier models on deep reasoning. The 11-point gap between GPT-5.4 and Gemini 3.1 Pro would be invisible on MMLU (where they're within 2 points). However, HLE is less useful for mid-tier models, which mostly score in single digits with little variance between them.

HLE is the hardest public AI benchmark available. Frontier models score 95-99% on most knowledge tests — on HLE, the best score is 46%. The 11-point gap between first and fifth place reveals performance differences that every other knowledge benchmark masks. If you want to know where frontier AI actually stands, HLE is the only benchmark that still has room to tell you.

Humanity's Last Exam (HLE) is the hardest public AI benchmark available. While frontier models score 95-99% on most knowledge benchmarks, HLE scores range from the single digits to the mid-40s. It's the one benchmark where the gap between models is impossible to ignore.

In a landscape where MMLU and even GPQA are approaching saturation, HLE remains the clearest measure of where frontier AI actually stands — and where it falls short.

What makes HLE different

HLE was crowdsourced from thousands of domain experts worldwide, organized by the Center for AI Safety and Scale AI. The questions are designed to:

Test frontier-level knowledge — questions that even specialists find difficult
Cover cutting-edge domains — advanced mathematics, theoretical physics, philosophy, and other fields at the edge of human knowledge
Resist memorization — novel, expert-crafted questions not found in training data
Scale with AI progress — the benchmark was designed to remain challenging as models improve

This isn't a test of whether a model can recall facts. It's a test of whether a model can reason at the level of the world's top researchers.

How questions are sourced

HLE's question creation process is unprecedented in scale. Over 3,000 domain experts from top universities and research institutions contributed questions. Each question goes through multiple validation rounds:

Expert creates a question in their area of specialization — often at the frontier of their field
Other experts verify the answer is correct and the question is appropriately difficult
Difficulty calibration ensures questions require genuine expertise, not just encyclopedia knowledge
Format standardization converts questions into consistent multiple-choice or short-answer formats

The result is a benchmark that probes knowledge most humans — even highly educated ones — simply don't have.

Current scores

HLE has the widest spread of any benchmark we track:

Model	HLE Score
GPT-5.4	46
GPT-5.3 Codex	44
GPT-5.2	40
Claude Opus 4.6	38
Gemini 3.1 Pro	35

Full leaderboard: HLE scores

The 11-point gap between GPT-5.4 (46) and Gemini 3.1 Pro (35) is massive. On MMLU, these models are within 2 points of each other. HLE reveals differences that other benchmarks can't see.

Why the scores are so low

Even the best model scores below 50%. This tells us something important: current AI models have genuine limitations in deep expert reasoning. They're excellent at processing known information but still struggle with questions that require true expert-level insight.

Several factors contribute to the low scores:

Knowledge recency

Many HLE questions reference findings published after a model's training cutoff. A question about a 2025 theorem proof or a recent experimental result can't be answered from training data alone — it requires genuine reasoning about unfamiliar material.

Depth vs. breadth

Models trained on internet-scale data have extraordinary breadth. They know something about almost everything. But HLE tests depth — the kind of expertise that takes years of focused study in a narrow field. Current models are wide but not always deep enough.

Multi-step expert reasoning

The hardest HLE questions require chaining multiple pieces of specialist knowledge together. A physics question might require combining quantum field theory with statistical mechanics in a way that even PhD students find challenging. Models often get the individual pieces right but fail to connect them.

HLE vs other knowledge benchmarks

Benchmark	Top Score	Score Spread	Saturation Risk	Best For
MMLU	93	70-93	High	General knowledge baseline
MMLU-Pro	87	50-87	Medium	Harder multiple choice
GPQA	97	80-97	High	PhD-level science (3 domains)
SuperGPQA	95	55-95	Medium	PhD-level (285 domains)
HLE	46	10-46	Very low	Frontier reasoning

HLE is the only benchmark where no model has crossed 50%. This means it will remain useful for differentiating models for years.

When to use HLE for evaluation

HLE is most useful when:

Comparing frontier models — it's the best benchmark for seeing real differences between GPT-5.4, Claude Opus 4.6, and Gemini 3.1 Pro
Tracking AI progress over time — scores are far from saturation, so improvements will be visible for years
Assessing deep reasoning — if your use case requires PhD-level scientific knowledge, HLE scores are the best predictor
Evaluating reasoning models — models with chain-of-thought capabilities tend to show larger improvements on HLE than non-reasoning models

HLE is less useful for evaluating mid-tier models (most score in single digits) or for predicting performance on typical business tasks.

What HLE scores mean for AI capabilities

Human experts score about 74% on HLE questions in their own domain of expertise. On questions outside their specialty, experts score much lower — sometimes below the models.
Progress has been rapid: Early 2025 frontier models scored around 10-15% on HLE. By early 2026, scores have tripled to the mid-40s.
The benchmark was designed to remain relevant as AI improves, and it's delivering on that promise.

The bottom line

HLE is the single most important benchmark for tracking AI progress at the frontier. While other benchmarks have become checkboxes, HLE still has room to differentiate.

→ See all models on the full leaderboard · Knowledge rankings

Frequently asked questions

What is HLE (Humanity's Last Exam)? HLE is an AI benchmark crowdsourced from 3,000+ domain experts at top universities. It tests frontier-level knowledge across 100+ fields — advanced math, theoretical physics, philosophy, and more. No model has crossed 50%: the top score as of March 2026 is 46 (GPT-5.4).

What score does GPT-5.4 get on HLE? GPT-5.4 scores 46 on HLE, the highest tracked by BenchLM.ai. GPT-5.3 Codex: 44, GPT-5.2: 40, Claude Opus 4.6: 38, Gemini 3.1 Pro: 35. See the HLE leaderboard for current rankings.

Why do top AI models score so low on HLE? HLE questions reference recent research (post training cutoff), require chaining specialist knowledge across fields, and use novel problem structures. Even human domain experts score ~74% on questions in their specialty. AI models have broad knowledge but not always the depth these questions demand.

How is HLE different from GPQA and MMLU? MMLU frontier models score 97-99% (saturated). GPQA top scores 95-97% (approaching saturation). HLE top scores 10-46% (plenty of room). HLE provides far more differentiation between frontier models than either alternative.

Is HLE a good benchmark for comparing AI models? Yes — for frontier models, it is the best public benchmark. The 11-point gap between GPT-5.4 and Gemini 3.1 Pro would be invisible on MMLU. Less useful for mid-tier models, which mostly score in single digits.

Data from BenchLM.ai. Last updated March 2026.

HLE (Humanity's Last Exam): The Hardest Benchmark

What makes HLE different

How questions are sourced

Current scores

Why the scores are so low

Knowledge recency

Depth vs. breadth

Multi-step expert reasoning

HLE vs other knowledge benchmarks

When to use HLE for evaluation

What HLE scores mean for AI capabilities

The bottom line

Frequently asked questions

Don't miss the next GPT moment

Related Posts

GPQA Diamond: The PhD-Level Science Benchmark

MMLU vs MMLU-Pro: What Changed and Why It Matters

React Native Evals: The Mobile App Coding Benchmark Explained

Stay ahead of the LLM curve