Evaluating Large Language Models Trained on Code (HumanEval)

A set of 164 handwritten programming problems that test the ability to generate correct Python functions from natural language descriptions. Each problem includes function signature, docstring, body, and several unit tests.

About HumanEval

Year

2021

Tasks

164 problems

Format

Python function generation

Difficulty

Introductory to intermediate programming

HumanEval measures functional correctness for synthesizing programs from docstrings. It focuses on whether generated code actually works correctly rather than just looking syntactically correct. Problems range from simple string manipulation to more complex algorithmic challenges.

Evaluating Large Language Models Trained on Code

Leaderboard (88 models)

#1GPT-5.3 Codex
95
#2GPT-5.2-Codex
95
#4Claude Sonnet 4.6
93
#5GPT-5.4
91
#6Gemini 3.1 Pro
91
#7Claude Opus 4.6
91
#8Grok 4.1
91
#9GPT-5.2
91
#11Claude Opus 4.5
91
#12Gemini 3 Pro
91
#13GPT-5.1
89
#14GLM-5 (Reasoning)
88
#15Claude Sonnet 4.5
87
#16o1-preview
86
#18GPT-5 (high)
85
#19Kimi K2.5 (Reasoning)
84
#20GPT-5 (medium)
83
#22DeepSeek Coder 2.0
82
#23GPT-5 mini
80
#24o3-pro
80
#25GLM-5
80
#26Grok 4
79
#28o3
78
#29GLM-4.7
78
#30Qwen2.5-1M
76
#31DeepSeek V3.2
76
#32Qwen2.5-72B
75
#33Gemini 2.5 Pro
75
#34Qwen3.5 397B
75
#35o4-mini (high)
74
#36DeepSeek LLM 2.0
73
#37DeepSeekMath V2
72
#38MiMo-V2-Flash
71
#39Kimi K2.5
69
#40Claude 4.1 Opus
68
#41Mistral Large 3
68
#43Claude 4 Sonnet
65
#44MiniMax M2.5
65
#46Gemini 3 Flash
62
#47Mistral Large 2
60
#48Claude Haiku 4.5
60
#50GPT-4o
58
#51GLM-4.7-Flash
58
#52Claude 3.5 Sonnet
57
#54Gemini 1.5 Pro
56
#55Mistral 8x7B
55
#57Gemini 1.0 Pro
54
#58Claude 3 Opus
53
#59GPT-4 Turbo
52
#60Llama 3 70B
50
#62Claude 3 Haiku
48
#63Nemotron-4 15B
46
#64Moonshot v1
45
#65Z-1
44
#66GPT-OSS 120B
43
#67Gemini 2.5 Flash
42
#70Llama 4 Scout
39
#72Gemma 3 27B
37
#73DeepSeek-R1
36
#74Qwen2.5-VL-32B
35
#76Nova Pro
33
#78Qwen3 235B 2507
31
#80GLM-4.5
29
#81MiniMax M1 80k
28
#82GLM-4.5-Air
27
#84DeepSeek V3.1
25
#85Kimi K2
24
#86GPT-OSS 20B
23
#87Mistral 7B v0.3
22
#88Mistral 8x7B v0.2
21

FAQ

What does HumanEval measure?

A set of 164 handwritten programming problems that test the ability to generate correct Python functions from natural language descriptions. Each problem includes function signature, docstring, body, and several unit tests.

Which model scores highest on HumanEval?

GPT-5.3 Codex by OpenAI currently leads with a score of 95 on HumanEval.

How many models are evaluated on HumanEval?

88 AI models have been evaluated on HumanEval on BenchLM.