Skip to main content

Evaluating Large Language Models Trained on Code (HumanEval)

A set of 164 handwritten programming problems that test the ability to generate correct Python functions from natural language descriptions. Each problem includes function signature, docstring, body, and several unit tests.

How BenchLM shows HumanEval right now

BenchLM is tracking HumanEval in the local dataset, but exact-source verification records for these rows are still being attached. To avoid a blank benchmark page, BenchLM shows the current tracked rows below as a display-only reference table.

These tracked rows are useful for inspection and spot-checking, but until exact-source attachments are completed they should not be treated as fully verified public benchmark rows.

119 tracked modelsLocal tracked rowsAwaiting exact-source attachmentsDisplay only

Tracked score on HumanEval — April 20, 2026

BenchLM mirrors the published tracked score view for HumanEval. Kimi K2.5 (Reasoning) leads the public snapshot at 99% , followed by Kimi K2.5 (99%) and GPT-5.2-Codex (95%). BenchLM does not use these results to rank models overall.

119 modelsCodingStaleSaturatedDisplay onlyUpdated April 20, 2026

The published HumanEval snapshot is tightly clustered at the top: Kimi K2.5 (Reasoning) sits at 99%, while the third row is only 4.0 points behind. The broader top-10 spread is 6.7 points, so many of the published scores sit in a relatively narrow band.

119 models have been evaluated on HumanEval. The benchmark falls in the Coding category. This category carries a 20% weight in BenchLM.ai's overall scoring system. HumanEval is currently displayed for reference but excluded from the scoring formula, so it does not directly affect overall rankings.

About HumanEval

Year

2021

Tasks

164 problems

Format

Python function generation

Difficulty

Introductory to intermediate programming

HumanEval measures functional correctness for synthesizing programs from docstrings. It focuses on whether generated code actually works correctly rather than just looking syntactically correct. Problems range from simple string manipulation to more complex algorithmic challenges.

BenchLM freshness & provenance

Version

HumanEval

Refresh cadence

Static

Staleness state

Stale

Question availability

Public benchmark set

StaleSaturatedDisplay only

BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.

Tracked score table (119 models)

1
Kimi K2.5 (Reasoning)kimi-k2-5-reasoning
99%
2
Kimi K2.5kimi-k2-5
99%
3
GPT-5.2-Codexgpt-5-2-codex
95%
4
GPT-5.3 Codexgpt-5-3-codex
95%
5
GPT-5.4gpt-5-4
95%
6
GLM-4.7glm-4-7
94.2%
7
GPT-5.1-Codex-Maxgpt-5-1-codex-max
94%
8
Claude Sonnet 4.6claude-sonnet-4-6
93%
9
GPT-5.2 Progpt-5-2-pro
93%
10
Mistral Large 3mistral-large-3
92.3%
11
Mistral Medium 3mistral-medium-3
92.1%
12
Sarvam 30Bsarvam-30b
92.1%
13
Claude 3.5 Sonnetclaude-3-5-sonnet
92%
14
DeepSeek-R1deepseek-r1
92%
15
Qwen2.5-VL-32Bqwen2-5-vl-32b
91.5%
16
Grok 4.1grok-4-1
91%
17
Gemini 3 Pro Deep Thinkgemini-3-pro-deep-think
91%
18
Gemini 3.1 Progemini-3-1-pro
91%
19
Claude Opus 4.6claude-opus-4-6
91%
20
GPT-5.2gpt-5-2
91%
21
Gemini 3 Progemini-3-pro
91%
22
Claude Opus 4.5claude-opus-4-5
91%
23
GPT-5.3-Codex-Sparkgpt-5-3-codex-spark
91%
24
GPT-5.1gpt-5-1
89%
25
GLM-5 (Reasoning)glm-5-reasoning
88%
26
GPT-5.3 Instantgpt-5-3-instant
88%
27
GPT-4o minigpt-4o-mini
87.2%
28
Claude Sonnet 4.5claude-sonnet-4-5
87%
29
GPT-5.2 Instantgpt-5-2-instant
87%
30
86%
31
Grok 4.1 Fastgrok-4-1-fast
86%
32
GPT-5 (high)gpt-5-high
85%
33
Claude 3 Opusclaude-3-opus
84.9%
34
MiMo-V2-Flashmimo-v2-flash
84.8%
35
Mistral Small 4mistral-small-4
84.8%
36
GPT-5 (medium)gpt-5-medium
83%
37
Qwen3.5 397B (Reasoning)qwen3-5-397b-reasoning
83%
38
Phi-4phi-4
82.6%
39
DeepSeek Coder 2.0deepseek-coder-2-0
82%
40
80%
41
GLM-5glm-5
80%
42
GPT-5 minigpt-5-mini
80%
43
Grok 4grok-4
79%
44
DeepSeek V3.2 (Thinking)deepseek-v3-2-thinking
79%
45
Ministral 3 14B (Reasoning)ministral-3-14b-reasoning
78.5%
46
78%
47
Step 3.5 Flashstep-3-5-flash
77%
48
Qwen2.5-1Mqwen2-5-1m
76%
49
DeepSeek V3.2deepseek-v3-2
76%
50
Qwen3.5 397Bqwen3-5-397b
75%
51
Gemini 2.5 Progemini-2-5-pro
75%
52
Qwen2.5-72Bqwen2-5-72b
75%
53
Mercury 2mercury-2
75%
54
o4-mini (high)o4-mini-high
74%
55
Granite-4.0-H-1Bgranite-4-0-h-1b
74%
56
DeepSeek LLM 2.0deepseek-llm-2-0
73%
57
Claude 3 Haikuclaude-3-haiku
73%
58
Granite-4.0-1Bgranite-4-0-1b
73%
59
DeepSeekMath V2deepseekmath-v2
72%
60
DBRX Instructdbrx-instruct
70.1%
61
Claude 4.1 Opusclaude-4-1-opus
68%
62
Claude 4.1 Opus Thinkingclaude-4-1-opus-thinking
68%
63
Nemotron 3 Ultra 500Bnemotron-3-ultra-500b
66%
64
Aion-2.0aion-2-0
66%
65
Claude 4 Sonnetclaude-4-sonnet
65%
66
MiniMax M2.5minimax-m2-5
65%
67
Seed 1.6seed-1-6
64%
68
Seed-2.0-Liteseed-2-0-lite
63%
69
Gemini 3 Flashgemini-3-flash
62%
70
Llama 3.1 405Bllama-3-1-405b
62%
71
Claude Haiku 4.5claude-haiku-4-5
60%
72
Grok Code Fast 1grok-code-fast-1
60%
73
Mistral Large 2mistral-large-2
60%
74
Nemotron 3 Super 120B A12Bnemotron-3-super-120b-a12b
59%
75
Seed 1.6 Flashseed-1-6-flash
59%
76
GPT-4ogpt-4o
58%
77
GLM-4.7-Flashglm-4-7-flash
58%
78
Ministral 3 14Bministral-3-14b
58%
79
Nemotron 3 Super 100Bnemotron-3-super-100b
57%
80
Gemini 1.5 Progemini-1-5-pro
56%
81
Gemini 3.1 Flash-Litegemini-3-1-flash-lite
55%
82
Seed-2.0-Miniseed-2-0-mini
55%
83
Mixtral 8x22B Instruct v0.1mixtral-8x22b-instruct-v0-1
54.8%
84
Gemini 1.0 Progemini-1-0-pro
54%
85
GPT-4 Turbogpt-4-turbo
52%
86
Llama 3 70Bllama-3-70b
50%
87
Nemotron 3 Nano 30Bnemotron-3-nano-30b
49%
88
Nemotron-4 15Bnemotron-4-15b
46%
89
Moonshot v1moonshot-v1
45%
90
Z-1z-1
44%
91
GPT-OSS 120Bgpt-oss-120b
43%
92
Gemini 2.5 Flashgemini-2-5-flash
42%
93
LFM2-24B-A2Blfm2-24b-a2b
42%
94
Nemotron Ultra 253Bnemotron-ultra-253b
41%
95
Llama 4 Behemothllama-4-behemoth
40%
96
Llama 4 Scoutllama-4-scout
39%
97
Granite-4.0-H-350Mgranite-4-0-h-350m
39%
98
Llama 4 Maverickllama-4-maverick
38%
99
Granite-4.0-350Mgranite-4-0-350m
38%
100
Gemma 3 27Bgemma-3-27b
37%
101
Grok 3 [Beta]grok-3-beta
34%
102
Nova Pronova-pro
33%
103
Mistral 8x7Bmistral-8x7b
32.3%
104
Qwen3 235B 2507 (Reasoning)qwen3-235b-2507-reasoning
32%
105
Qwen3 235B 2507qwen3-235b-2507
31%
106
Mistral 7B v0.3mistral-7b-v0-3
30.5%
107
GLM-4.5glm-4-5
29%
108
MiniMax M1 80kminimax-m1-80k
28%
109
GLM-4.5-Airglm-4-5-air
27%
110
DeepSeek V3.1 (Reasoning)deepseek-v3-1-reasoning
26%
111
DeepSeek V3.1deepseek-v3-1
25%
112
Ministral 3 8B (Reasoning)ministral-3-8b-reasoning
24%
113
GPT-OSS 20Bgpt-oss-20b
23%
114
Ministral 3 8Bministral-3-8b
23%
115
Mistral 8x7B v0.2mistral-8x7b-v0-2
21%
116
LFM2.5-1.2B-Thinkinglfm2-5-1-2b-thinking
17%
117
Ministral 3 3B (Reasoning)ministral-3-3b-reasoning
16%
118
Ministral 3 3Bministral-3-3b
15%
119
LFM2.5-1.2B-Instructlfm2-5-1-2b-instruct
14%

FAQ

What does HumanEval measure?

A set of 164 handwritten programming problems that test the ability to generate correct Python functions from natural language descriptions. Each problem includes function signature, docstring, body, and several unit tests.

Which model leads the published HumanEval snapshot?

Kimi K2.5 (Reasoning) currently leads the published HumanEval snapshot with a tracked score of 99%. BenchLM shows this benchmark for display only and does not use it in overall rankings.

How many models are evaluated on HumanEval?

119 AI models are included in BenchLM's mirrored HumanEval snapshot, based on the public leaderboard captured on April 20, 2026.

Last updated: April 20, 2026 · mirrored from the public benchmark leaderboard

The AI models change fast. We track them for you.

For engineers, researchers, and the plain curious — a weekly brief on new models, ranking shifts, and pricing changes.

Free. No spam. Unsubscribe anytime.