Skip to main content

Evaluating Large Language Models Trained on Code (HumanEval)

A set of 164 handwritten programming problems that test the ability to generate correct Python functions from natural language descriptions. Each problem includes function signature, docstring, body, and several unit tests.

Benchmark score on HumanEval — May 13, 2026

BenchLM mirrors the published score view for HumanEval. DeepSeek V4 Pro Base leads the public snapshot at 76.8% , followed by DeepSeek V4 Flash Base (69.5%). BenchLM does not use these results to rank models overall.

2 modelsCodingStaleSaturatedDisplay onlyUpdated May 13, 2026

About HumanEval

Year

2021

Tasks

164 problems

Format

Python function generation

Difficulty

Introductory to intermediate programming

HumanEval measures functional correctness for synthesizing programs from docstrings. It focuses on whether generated code actually works correctly rather than just looking syntactically correct. Problems range from simple string manipulation to more complex algorithmic challenges.

BenchLM freshness & provenance

Version

HumanEval

Refresh cadence

Static

Staleness state

Stale

Question availability

Public benchmark set

StaleSaturatedDisplay only

BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.

Benchmark score table (2 models)

1
76.8%
2
69.5%

FAQ

What does HumanEval measure?

A set of 164 handwritten programming problems that test the ability to generate correct Python functions from natural language descriptions. Each problem includes function signature, docstring, body, and several unit tests.

Which model scores highest on HumanEval?

DeepSeek V4 Pro Base by DeepSeek currently leads with a score of 76.8% on HumanEval.

How many models are evaluated on HumanEval?

2 AI models have been evaluated on HumanEval on BenchLM.

Compare Top Models on HumanEval

Last updated: May 13, 2026 · BenchLM version HumanEval

The AI models change fast. We track them for you.

For engineers, researchers, and the plain curious — a weekly brief on new models, ranking shifts, and pricing changes.

Free. No spam. Unsubscribe anytime.