Coding Benchmarks
Programming and software development - Compare AI models across 1 programming benchmarks including HumanEval, CodeContest, and more.
Filters & Search
Filter models by creator or search by name to find the perfect AI model for your needs
Coding Benchmark Results
Showing 15 of 20 models • Click column headers to sort
HumanEvalPython programming problems with test cases | ||||
---|---|---|---|---|
1 GPT-5 (high) OpenAI | OpenAI | Closed-source | 69 | 82% |
2 GPT-5 (medium) OpenAI | OpenAI | Closed-source | 68 | 80% |
3 Grok 4 xAI | xAI | Closed-source | 68 | 78% |
4 o3-pro OpenAI | OpenAI | Closed-source | 68 | 79% |
5 o3 OpenAI | OpenAI | Closed-source | 67 | 77% |
6 o4-mini (high) OpenAI | OpenAI | Closed-source | 65 | 74% |
7 Gemini 2.5 Pro Google | Closed-source | 65 | 75% | |
8 GPT-5 mini OpenAI | OpenAI | Closed-source | 64 | 71% |
9 Claude 4.1 Opus Anthropic | Anthropic | Closed-source | 61 | 68% |
10 Claude 4 Sonnet Anthropic | Anthropic | Closed-source | 59 | 65% |
11 Llama 3.1 405B Meta | Meta | Open-source | 58 | 62% |
12 Mistral Large 2 Mistral | Mistral | Open-source | 57 | 60% |
13 GPT-4o OpenAI | OpenAI | Closed-source | 56 | 58% |
14 Claude 3.5 Sonnet Anthropic | Anthropic | Closed-source | 55 | 57% |
15 Gemini 1.5 Pro Google | Closed-source | 54 | 56% |
Showing 15 of 20 models
About Coding Benchmarks
HumanEval
Python programming problems with test cases