LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code (LiveCodeBench)

A continuously updated benchmark using fresh competitive programming problems from LeetCode, Codeforces, and AtCoder to provide contamination-free code generation evaluation.

About LiveCodeBench

Year

2024

Tasks

Continuously updated

Format

Competitive programming

Difficulty

Competitive programming level

LiveCodeBench addresses data contamination concerns by continuously sourcing new problems from competitive programming platforms. It evaluates code generation, self-repair, code execution, and test output prediction.

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

Leaderboard (88 models)

#1GPT-5.3 Codex
85
#2GPT-5.2
79
#3GPT-5.4
75
#4Claude Opus 4.6
75
#5Grok 4.1
73
#6Gemini 3.1 Pro
71
#8GPT-5.2-Codex
66
#9GPT-5 (high)
62
#10GPT-5.1
61
#11o1-preview
60
#12GPT-5 (medium)
60
#15GLM-5 (Reasoning)
58
#16Kimi K2.5 (Reasoning)
58
#17Claude Opus 4.5
57
#18Claude Sonnet 4.6
54
#20Claude Sonnet 4.5
53
#21Gemini 3 Pro
49
#22MiMo-V2-Flash
49
#24DeepSeek Coder 2.0
45
#25GLM-4.7-Flash
45
#26o3-pro
44
#27DeepSeekMath V2
44
#28GLM-4.7
43
#30o3
40
#31Qwen2.5-1M
40
#32Qwen2.5-72B
40
#33Claude 4.1 Opus
40
#34DeepSeek V3.2
39
#35Qwen3.5 397B
39
#36DeepSeek LLM 2.0
39
#37Mistral Large 3
39
#38Claude 3.5 Sonnet
39
#39Mistral Large 2
38
#41GPT-4o
38
#42GPT-5 mini
37
#43Grok 4
37
#44Gemini 2.5 Pro
37
#45Kimi K2.5
37
#47Claude 4 Sonnet
36
#48Gemini 3 Flash
36
#49Claude Haiku 4.5
36
#50GLM-5
35
#51MiniMax M2.5
35
#52o4-mini (high)
34
#54GPT-OSS 120B
25
#56Mistral 8x7B
23
#57GPT-4 Turbo
23
#58Gemini 1.5 Pro
22
#59Nemotron-4 15B
22
#60Z-1
22
#62Moonshot v1
21
#63Claude 3 Opus
20
#64Claude 3 Haiku
20
#65Llama 3 70B
19
#66DeepSeek-R1
19
#67Gemini 2.5 Flash
18
#69Gemini 1.0 Pro
16
#73Gemma 3 27B
15
#75GLM-4.5-Air
15
#76DeepSeek V3.1
15
#77Nova Pro
14
#78Mistral 7B v0.3
14
#80GLM-4.5
13
#81Kimi K2
12
#82Mistral 8x7B v0.2
12
#83Llama 4 Scout
11
#84Qwen2.5-VL-32B
11
#86GPT-OSS 20B
11
#87Qwen3 235B 2507
10
#88MiniMax M1 80k
10

FAQ

What does LiveCodeBench measure?

A continuously updated benchmark using fresh competitive programming problems from LeetCode, Codeforces, and AtCoder to provide contamination-free code generation evaluation.

Which model scores highest on LiveCodeBench?

GPT-5.3 Codex by OpenAI currently leads with a score of 85 on LiveCodeBench.

How many models are evaluated on LiveCodeBench?

88 AI models have been evaluated on LiveCodeBench on BenchLM.