A set of 164 handwritten programming problems that test the ability to generate correct Python functions from natural language descriptions. Each problem includes function signature, docstring, body, and several unit tests.
Year
2021
Tasks
164 problems
Format
Python function generation
Difficulty
Introductory to intermediate programming
HumanEval measures functional correctness for synthesizing programs from docstrings. It focuses on whether generated code actually works correctly rather than just looking syntactically correct. Problems range from simple string manipulation to more complex algorithmic challenges.
Evaluating Large Language Models Trained on CodeA set of 164 handwritten programming problems that test the ability to generate correct Python functions from natural language descriptions. Each problem includes function signature, docstring, body, and several unit tests.
GPT-5.3 Codex by OpenAI currently leads with a score of 95 on HumanEval.
88 AI models have been evaluated on HumanEval on BenchLM.