A set of 164 handwritten programming problems that test the ability to generate correct Python functions from natural language descriptions. Each problem includes function signature, docstring, body, and several unit tests.
BenchLM is tracking HumanEval in the local dataset, but exact-source verification records for these rows are still being attached. To avoid a blank benchmark page, BenchLM shows the current tracked rows below as a display-only reference table.
These tracked rows are useful for inspection and spot-checking, but until exact-source attachments are completed they should not be treated as fully verified public benchmark rows.
BenchLM mirrors the published tracked score view for HumanEval. Kimi K2.5 (Reasoning) leads the public snapshot at 99% , followed by Kimi K2.5 (99%) and GPT-5.2-Codex (95%). BenchLM does not use these results to rank models overall.
Kimi K2.5 (Reasoning)
Moonshot AI
kimi-k2-5-reasoning
Kimi K2.5
Moonshot AI
kimi-k2-5
GPT-5.2-Codex
OpenAI
gpt-5-2-codex
The published HumanEval snapshot is tightly clustered at the top: Kimi K2.5 (Reasoning) sits at 99%, while the third row is only 4.0 points behind. The broader top-10 spread is 6.7 points, so many of the published scores sit in a relatively narrow band.
119 models have been evaluated on HumanEval. The benchmark falls in the Coding category. This category carries a 20% weight in BenchLM.ai's overall scoring system. HumanEval is currently displayed for reference but excluded from the scoring formula, so it does not directly affect overall rankings.
Year
2021
Tasks
164 problems
Format
Python function generation
Difficulty
Introductory to intermediate programming
HumanEval measures functional correctness for synthesizing programs from docstrings. It focuses on whether generated code actually works correctly rather than just looking syntactically correct. Problems range from simple string manipulation to more complex algorithmic challenges.
Version
HumanEval
Refresh cadence
Static
Staleness state
Stale
Question availability
Public benchmark set
BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.
A set of 164 handwritten programming problems that test the ability to generate correct Python functions from natural language descriptions. Each problem includes function signature, docstring, body, and several unit tests.
Kimi K2.5 (Reasoning) currently leads the published HumanEval snapshot with a tracked score of 99%. BenchLM shows this benchmark for display only and does not use it in overall rankings.
119 AI models are included in BenchLM's mirrored HumanEval snapshot, based on the public leaderboard captured on April 20, 2026.
For engineers, researchers, and the plain curious — a weekly brief on new models, ranking shifts, and pricing changes.
Free. No spam. Unsubscribe anytime.