Humanity's Last Exam (HLE)

An extremely challenging benchmark crowd-sourced from thousands of domain experts worldwide, designed to probe the absolute frontier of AI capabilities with questions that even specialists find difficult.

About HLE

Year

2025

Tasks

Expert-level questions

Format

Open-ended and multiple choice

Difficulty

Frontier expert level

HLE represents the hardest public benchmark available, with top models scoring only 10-45%. Questions span advanced mathematics, theoretical physics, philosophy, and other fields at the cutting edge of human knowledge.

Humanity's Last Exam

Leaderboard (88 models)

#1GPT-5.4
46
#2GPT-5.3 Codex
44
#3GPT-5.2
42
#4Gemini 3.1 Pro
40
#5Grok 4.1
40
#6Claude Opus 4.6
38
#8o1-preview
32
#9GLM-5 (Reasoning)
29
#12GPT-5.1
27
#13GPT-5 (high)
27
#14GPT-5 (medium)
27
#15Kimi K2.5 (Reasoning)
27
#16GPT-5.2-Codex
26
#17o3-pro
26
#18o3
24
#20Claude Sonnet 4.6
21
#21Claude Sonnet 4.5
21
#22Claude Opus 4.5
20
#23Gemini 3 Pro
20
#25DeepSeekMath V2
18
#26GPT-5 mini
16
#27Grok 4
16
#28GLM-4.7
16
#30GLM-4.7-Flash
15
#31DeepSeek Coder 2.0
14
#32MiMo-V2-Flash
14
#33DeepSeek-R1
14
#34GLM-5
13
#35o4-mini (high)
13
#37DeepSeek LLM 2.0
12
#38Mistral Large 3
12
#39Claude 4 Sonnet
12
#40Mistral Large 2
12
#41Qwen2.5-72B
11
#42DeepSeek V3.2
11
#43Claude 4.1 Opus
11
#44Kimi K2.5
11
#45Claude Haiku 4.5
11
#47Qwen2.5-1M
10
#48Qwen3.5 397B
10
#49MiniMax M2.5
10
#51Mistral 8x7B
8
#55Gemini 3 Flash
6
#56Z-1
6
#58Claude 3.5 Sonnet
5
#59Nemotron-4 15B
5
#60Moonshot v1
5
#61GPT-OSS 120B
5
#62Mistral 7B v0.3
5
#64Nova Pro
4
#65GLM-4.5-Air
4
#66Gemini 2.5 Pro
3
#68Gemma 3 27B
3
#70GLM-4.5
3
#71Kimi K2
3
#73Llama 3 70B
2
#74Claude 3 Haiku
2
#76Qwen2.5-VL-32B
2
#77MiniMax M1 80k
2
#78DeepSeek V3.1
2
#79GPT-4o
1
#80Gemini 1.5 Pro
1
#82Claude 3 Opus
1
#83GPT-4 Turbo
1
#84Gemini 1.0 Pro
1
#87Qwen3 235B 2507
1
#88GPT-OSS 20B
1

FAQ

What does HLE measure?

An extremely challenging benchmark crowd-sourced from thousands of domain experts worldwide, designed to probe the absolute frontier of AI capabilities with questions that even specialists find difficult.

Which model scores highest on HLE?

GPT-5.4 by OpenAI currently leads with a score of 46 on HLE.

How many models are evaluated on HLE?

88 AI models have been evaluated on HLE on BenchLM.