Skip to main content
benchmarksknowledgehleexplainer

HLE (Humanity's Last Exam): The Hardest Benchmark

Humanity's Last Exam is crowdsourced from thousands of domain experts and designed to probe the absolute frontier of AI. Top models score only 10-46%. Here's why HLE matters.

Glevd·Published March 7, 2026·10 min read

Share This Report

Copy the link, post it, or save a PDF version.

Share on XShare on LinkedIn

HLE is the hardest public AI benchmark available. Frontier models score 95-99% on most knowledge tests — on HLE, the best score is 46%. The 11-point gap between first and fifth place reveals performance differences that every other knowledge benchmark masks. If you want to know where frontier AI actually stands, HLE is the only benchmark that still has room to tell you.

Humanity's Last Exam (HLE) is the hardest public AI benchmark available. While frontier models score 95-99% on most knowledge benchmarks, HLE scores range from the single digits to the mid-40s. It's the one benchmark where the gap between models is impossible to ignore.

In a landscape where MMLU and even GPQA are approaching saturation, HLE remains the clearest measure of where frontier AI actually stands — and where it falls short.

What makes HLE different

HLE was crowdsourced from thousands of domain experts worldwide, organized by the Center for AI Safety and Scale AI. The questions are designed to:

  • Test frontier-level knowledge — questions that even specialists find difficult
  • Cover cutting-edge domains — advanced mathematics, theoretical physics, philosophy, and other fields at the edge of human knowledge
  • Resist memorization — novel, expert-crafted questions not found in training data
  • Scale with AI progress — the benchmark was designed to remain challenging as models improve

This isn't a test of whether a model can recall facts. It's a test of whether a model can reason at the level of the world's top researchers.

How questions are sourced

HLE's question creation process is unprecedented in scale. Over 3,000 domain experts from top universities and research institutions contributed questions. Each question goes through multiple validation rounds:

  1. Expert creates a question in their area of specialization — often at the frontier of their field
  2. Other experts verify the answer is correct and the question is appropriately difficult
  3. Difficulty calibration ensures questions require genuine expertise, not just encyclopedia knowledge
  4. Format standardization converts questions into consistent multiple-choice or short-answer formats

The result is a benchmark that probes knowledge most humans — even highly educated ones — simply don't have.

Current scores

HLE has the widest spread of any benchmark we track:

Model HLE Score
GPT-5.4 46
GPT-5.3 Codex 44
GPT-5.2 40
Claude Opus 4.6 38
Gemini 3.1 Pro 35

Full leaderboard: HLE scores

The 11-point gap between GPT-5.4 (46) and Gemini 3.1 Pro (35) is massive. On MMLU, these models are within 2 points of each other. HLE reveals differences that other benchmarks can't see.

Why the scores are so low

Even the best model scores below 50%. This tells us something important: current AI models have genuine limitations in deep expert reasoning. They're excellent at processing known information but still struggle with questions that require true expert-level insight.

Several factors contribute to the low scores:

Knowledge recency

Many HLE questions reference findings published after a model's training cutoff. A question about a 2025 theorem proof or a recent experimental result can't be answered from training data alone — it requires genuine reasoning about unfamiliar material.

Depth vs. breadth

Models trained on internet-scale data have extraordinary breadth. They know something about almost everything. But HLE tests depth — the kind of expertise that takes years of focused study in a narrow field. Current models are wide but not always deep enough.

Multi-step expert reasoning

The hardest HLE questions require chaining multiple pieces of specialist knowledge together. A physics question might require combining quantum field theory with statistical mechanics in a way that even PhD students find challenging. Models often get the individual pieces right but fail to connect them.

HLE vs other knowledge benchmarks

Benchmark Top Score Score Spread Saturation Risk Best For
MMLU 93 70-93 High General knowledge baseline
MMLU-Pro 87 50-87 Medium Harder multiple choice
GPQA 97 80-97 High PhD-level science (3 domains)
SuperGPQA 95 55-95 Medium PhD-level (285 domains)
HLE 46 10-46 Very low Frontier reasoning

HLE is the only benchmark where no model has crossed 50%. This means it will remain useful for differentiating models for years.

When to use HLE for evaluation

HLE is most useful when:

  • Comparing frontier models — it's the best benchmark for seeing real differences between GPT-5.4, Claude Opus 4.6, and Gemini 3.1 Pro
  • Tracking AI progress over time — scores are far from saturation, so improvements will be visible for years
  • Assessing deep reasoning — if your use case requires PhD-level scientific knowledge, HLE scores are the best predictor
  • Evaluating reasoning models — models with chain-of-thought capabilities tend to show larger improvements on HLE than non-reasoning models

HLE is less useful for evaluating mid-tier models (most score in single digits) or for predicting performance on typical business tasks.

What HLE scores mean for AI capabilities

  • Human experts score about 74% on HLE questions in their own domain of expertise. On questions outside their specialty, experts score much lower — sometimes below the models.
  • Progress has been rapid: Early 2025 frontier models scored around 10-15% on HLE. By early 2026, scores have tripled to the mid-40s.
  • The benchmark was designed to remain relevant as AI improves, and it's delivering on that promise.

The bottom line

HLE is the single most important benchmark for tracking AI progress at the frontier. While other benchmarks have become checkboxes, HLE still has room to differentiate.

See all models on the full leaderboard · Knowledge rankings


Frequently asked questions

What is HLE (Humanity's Last Exam)? HLE is an AI benchmark crowdsourced from 3,000+ domain experts at top universities. It tests frontier-level knowledge across 100+ fields — advanced math, theoretical physics, philosophy, and more. No model has crossed 50%: the top score as of March 2026 is 46 (GPT-5.4).

What score does GPT-5.4 get on HLE? GPT-5.4 scores 46 on HLE, the highest tracked by BenchLM.ai. GPT-5.3 Codex: 44, GPT-5.2: 40, Claude Opus 4.6: 38, Gemini 3.1 Pro: 35. See the HLE leaderboard for current rankings.

Why do top AI models score so low on HLE? HLE questions reference recent research (post training cutoff), require chaining specialist knowledge across fields, and use novel problem structures. Even human domain experts score ~74% on questions in their specialty. AI models have broad knowledge but not always the depth these questions demand.

How is HLE different from GPQA and MMLU? MMLU frontier models score 97-99% (saturated). GPQA top scores 95-97% (approaching saturation). HLE top scores 10-46% (plenty of room). HLE provides far more differentiation between frontier models than either alternative.

Is HLE a good benchmark for comparing AI models? Yes — for frontier models, it is the best public benchmark. The 11-point gap between GPT-5.4 and Gemini 3.1 Pro would be invisible on MMLU. Less useful for mid-tier models, which mostly score in single digits.


Data from BenchLM.ai. Last updated March 2026.

New models drop every week. We send one email a week with what moved and why.