Coding Benchmarks

Programming and software development - Compare AI models across 1 programming benchmarks including HumanEval, CodeContest, and more.

Filters & Search

Filter models by creator or search by name to find the perfect AI model for your needs

Coding Benchmark Results

Showing 15 of 20 models • Click column headers to sort

HumanEvalPython programming problems with test cases
1
GPT-5 (high)
OpenAI
OpenAIClosed-source6982%
2
GPT-5 (medium)
OpenAI
OpenAIClosed-source6880%
3
Grok 4
xAI
xAIClosed-source6878%
4
o3-pro
OpenAI
OpenAIClosed-source6879%
5
o3
OpenAI
OpenAIClosed-source6777%
6
o4-mini (high)
OpenAI
OpenAIClosed-source6574%
7
Gemini 2.5 Pro
Google
GoogleClosed-source6575%
8
GPT-5 mini
OpenAI
OpenAIClosed-source6471%
9
Claude 4.1 Opus
Anthropic
AnthropicClosed-source6168%
10
Claude 4 Sonnet
Anthropic
AnthropicClosed-source5965%
11
Llama 3.1 405B
Meta
MetaOpen-source5862%
12
Mistral Large 2
Mistral
MistralOpen-source5760%
13
GPT-4o
OpenAI
OpenAIClosed-source5658%
14
Claude 3.5 Sonnet
Anthropic
AnthropicClosed-source5557%
15
Gemini 1.5 Pro
Google
GoogleClosed-source5456%

Showing 15 of 20 models

About Coding Benchmarks

HumanEval

Python programming problems with test cases