benchmarksknowledgehleexplainer

HLE (Humanity's Last Exam): The Hardest Benchmark

Humanity's Last Exam is crowdsourced from thousands of domain experts and designed to probe the absolute frontier of AI. Top models score only 10-46%. Here's why HLE matters.

Glevd·March 7, 2026·10 min read

Humanity's Last Exam (HLE) is the hardest public AI benchmark available. While frontier models score 95-99% on most knowledge benchmarks, HLE scores range from the single digits to the mid-40s. It's the one benchmark where the gap between models is impossible to ignore.

In a landscape where MMLU and even GPQA are approaching saturation, HLE remains the clearest measure of where frontier AI actually stands — and where it falls short.

What makes HLE different

HLE was crowdsourced from thousands of domain experts worldwide, organized by the Center for AI Safety and Scale AI. The questions are designed to:

  • Test frontier-level knowledge — questions that even specialists find difficult
  • Cover cutting-edge domains — advanced mathematics, theoretical physics, philosophy, and other fields at the edge of human knowledge
  • Resist memorization — novel, expert-crafted questions not found in training data
  • Scale with AI progress — the benchmark was designed to remain challenging as models improve

This isn't a test of whether a model can recall facts. It's a test of whether a model can reason at the level of the world's top researchers.

How questions are sourced

HLE's question creation process is unprecedented in scale. Over 3,000 domain experts from top universities and research institutions contributed questions. Each question goes through multiple validation rounds:

  1. Expert creates a question in their area of specialization — often at the frontier of their field
  2. Other experts verify the answer is correct and the question is appropriately difficult
  3. Difficulty calibration ensures questions require genuine expertise, not just encyclopedia knowledge
  4. Format standardization converts questions into consistent multiple-choice or short-answer formats

The result is a benchmark that probes knowledge most humans — even highly educated ones — simply don't have. A question about recent advances in algebraic topology, the biochemistry of a newly discovered enzyme, or the implications of a 2025 physics paper is the kind of content HLE includes.

Current scores

HLE has the widest spread of any benchmark we track:

Model HLE Score
GPT-5.4 46
GPT-5.3 Codex 44
GPT-5.2 40
Claude Opus 4.6 38
Gemini 3.1 Pro 35

Full leaderboard: HLE scores

The 11-point gap between GPT-5.4 (46) and Gemini 3.1 Pro (35) is massive. On MMLU, these models are within 2 points of each other. HLE reveals differences that other benchmarks can't see.

Why the scores are so low

Even the best model scores below 50%. This tells us something important: current AI models have genuine limitations in deep expert reasoning. They're excellent at processing known information but still struggle with questions that require true expert-level insight.

Several factors contribute to the low scores:

Knowledge recency

Many HLE questions reference findings published after a model's training cutoff. A question about a 2025 theorem proof or a recent experimental result can't be answered from training data alone — it requires genuine reasoning about unfamiliar material.

Depth vs. breadth

Models trained on internet-scale data have extraordinary breadth. They know something about almost everything. But HLE tests depth — the kind of expertise that takes years of focused study in a narrow field. Current models are wide but not always deep enough.

Multi-step expert reasoning

The hardest HLE questions require chaining multiple pieces of specialist knowledge together. A physics question might require combining quantum field theory with statistical mechanics in a way that even PhD students find challenging. Models often get the individual pieces right but fail to connect them.

Novel problem structures

Unlike benchmarks with well-known problem formats, HLE includes question types models haven't been trained to handle. This tests genuine adaptation rather than pattern matching.

HLE vs other knowledge benchmarks

To understand HLE's value, compare it with the other major knowledge benchmarks:

Benchmark Top Score Score Spread Saturation Risk Best For
MMLU 93 70-93 High General knowledge baseline
MMLU-Pro 87 50-87 Medium Harder multiple choice
GPQA 97 80-97 High PhD-level science (3 domains)
SuperGPQA 95 55-95 Medium PhD-level (285 domains)
HLE 46 10-46 Very low Frontier reasoning

HLE is the only benchmark where no model has crossed 50%. This means it will remain useful for differentiating models for years — possibly the entire next generation of AI development.

When to use HLE for evaluation

HLE is most useful when:

  • Comparing frontier models — it's the best benchmark for seeing real differences between GPT-5.4, Claude Opus 4.6, and Gemini 3.1 Pro
  • Tracking AI progress over time — scores are far from saturation, so improvements will be visible for years
  • Assessing deep reasoning — if your use case requires PhD-level scientific knowledge, HLE scores are the best predictor
  • Evaluating reasoning models — models with chain-of-thought capabilities (o3, o4-mini, DeepSeek R1) tend to show larger improvements on HLE than non-reasoning models

HLE is less useful for:

  • Evaluating mid-tier models — most non-frontier models score in the single digits, so there's not enough variance to differentiate them
  • Practical task prediction — a score of 38 vs 46 on frontier expert questions may not predict performance on typical business or consumer tasks
  • Domain-specific evaluation — HLE covers many domains, so you can't isolate performance in any specific field. Use GPQA for science-specific evaluation.

What HLE scores mean for AI capabilities

The fact that the best AI scores below 50% on HLE is often cited as evidence that AI has "a long way to go." But context matters:

  • Human experts score about 74% on HLE questions in their own domain of expertise. On questions outside their specialty, experts score much lower — sometimes below the models.
  • The benchmark is designed to be hard for everything — including humans. A model scoring 46% is performing at roughly the level of a smart generalist researcher across all domains simultaneously.
  • Progress has been rapid: Early 2025 frontier models scored around 10-15% on HLE. By early 2026, scores have tripled to the mid-40s. If this trajectory continues, models could cross 70% by late 2026 or early 2027.

The bottom line

HLE is the single most important benchmark for tracking AI progress at the frontier. While other benchmarks have become checkboxes (score 90+ and move on), HLE still has room to differentiate. Every new frontier model release, the first question is: "What did it score on HLE?"

For broader knowledge assessment, pair HLE with MMLU-Pro and GPQA. For a complete picture including coding and math, see our overall model rankings or use the LLM Selector Quiz to find the best model for your specific needs.


Data from BenchLM.ai. Last updated March 2026.

Weekly LLM Updates

New model releases, benchmark scores, and leaderboard changes. Every Friday.

Free. Unsubscribe anytime.