Evaluating Large Language Models Trained on Code (HumanEval)

Name: Evaluating Large Language Models Trained on Code
Creator: BenchLM

A set of 164 handwritten programming problems that test the ability to generate correct Python functions from natural language descriptions. Each problem includes function signature, docstring, body, and several unit tests.

Benchmark score on HumanEval — May 13, 2026

BenchLM mirrors the published score view for HumanEval. DeepSeek V4 Pro Base leads the public snapshot at 76.8% , followed by DeepSeek V4 Flash Base (69.5%). BenchLM does not use these results to rank models overall.

1Open

DeepSeek V4 Pro Base

DeepSeek

76.8%

Overall —Context 1M

2Open

DeepSeek V4 Flash Base

DeepSeek

69.5%

Overall —Context 1M

2 modelsCodingStaleSaturatedDisplay onlyUpdated May 13, 2026

About HumanEval

Year

2021

Tasks

164 problems

Format

Python function generation

Difficulty

Introductory to intermediate programming

HumanEval measures functional correctness for synthesizing programs from docstrings. It focuses on whether generated code actually works correctly rather than just looking syntactically correct. Problems range from simple string manipulation to more complex algorithmic challenges.

Evaluating Large Language Models Trained on Code

BenchLM freshness & provenance

Version

HumanEval

Refresh cadence

Static

Staleness state

Stale

Question availability

Public benchmark set

StaleSaturatedDisplay only

BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.