Benchmark profile

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code (LiveCodeBench)

A continuously updated coding benchmark built from newly collected LeetCode, AtCoder, and Codeforces problems. Fresh problem windows reduce one contamination path, but results still need a release and setup check.

Data verified July 20, 2026

Qwen3.7 Max leads the LiveCodeBench leaderboard on BenchLM's July 2026 update with 91.6%, ahead of Qwen3.7 Plus (89.6%) and GLM-4.7 (84.9%), across 6 tracked models.

How to read this leaderboard

Editorial review by Glevd · 2026-07-15

Use LiveCodeBench as a competitive-programming signal after matching the release, date window, task scenario, pass@k metric, sampling count, temperature, and execution policy. Explicit v5, v6, and Pass@1-COT rows now have separate lanes; remaining generic rows can still use different or incompletely labeled windows.

Operator receipt: 6 sourced rows are currently displayable on this page; the leading published row is Qwen3.7 Max at 91.6%.

Honest limit: LiveCodeBench does not test work inside an existing repository. Release and date-window choice, the pruned versus original task set, generation settings, timeouts, and known autograding issues can affect a score. This legacy lane is a sourced result ledger, not one controlled BenchLM rerun.

Calibrated coding ranking SWE-bench Pro leaderboard LiveCodeBench v6 ledger LiveCodeBench Pass@1-COT ledger How benchmark confidence works

LiveCodeBench paper Official code and release notes Official leaderboard Official benchmark errata

Top models on LiveCodeBench — July 20, 2026

As of July 20, 2026, Qwen3.7 Max leads the LiveCodeBench leaderboard with 91.6% , followed by Qwen3.7 Plus (89.6%) and GLM-4.7 (84.9%).

1Closed

Qwen3.7 Max

Alibaba

qwen3-7-max

91.6%

Overall 72.84Context 1M

2Closed

Qwen3.7 Plus

Alibaba

qwen3-7-plus

89.6%

Overall 67.22Context 1M

3Open

GLM-4.7

Z.AI

glm-4-7

84.9%

Overall 61.16Context 200K

6 modelsCoding38% of category scoreCurrentUpdated July 20, 2026

Leaderboard (6 models)

Score

Qwen3.7 MaxAlibaba · Closed

91.6%

Qwen3.7 PlusAlibaba · Closed

89.6%

GLM-4.7Z.AI · Open weight

84.9%

Qwen3.6-27BAlibaba · Open weight

83.9%

Qwen3.6-35B-A3BAlibaba · Open weight

80.4%

DeepSeek V3DeepSeek · Open weight

37.6%

According to BenchLM.ai, Qwen3.7 Max leads the LiveCodeBench benchmark with a score of 91.6%, followed by Qwen3.7 Plus (89.6%) and GLM-4.7 (84.9%). There is significant spread across the leaderboard, making this benchmark effective at differentiating model capabilities.

6 models have been evaluated on LiveCodeBench. The benchmark falls in the Coding category. This category carries a 20% weight in BenchLM.ai's overall scoring system. Within that category, LiveCodeBench contributes 38% of the category score, so strong performance here directly affects a model's overall ranking.

About LiveCodeBench

Year

2024

Tasks

Continuously updated contest problems

Format

Competitive-programming evaluation

Difficulty

Competitive programming level

The official suite evaluates code generation, code execution, test-output prediction, and self-repair on contest problems added over time. Its releases and selectable date windows let evaluators separate older and newer problem sets instead of treating the benchmark as one fixed test.

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

BenchLM freshness & provenance

Version

Rolling 2026 set

Refresh cadence

Rolling

Staleness state

Current

Question availability

Delayed public release

Current

BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.

FAQ

What does LiveCodeBench measure?

LiveCodeBench evaluates coding on newly collected contest problems from LeetCode, AtCoder, and Codeforces. Its suite covers code generation, execution, test-output prediction, and self-repair, although provider tables commonly report code-generation pass rates. It tests competitive-programming correctness, not work inside an existing software repository.

Are LiveCodeBench scores directly comparable?

Only when the release, date window, scenario, pass@k metric, sampling count, temperature, and execution policy match. The official benchmark can change as new problems arrive, while providers may report different slices. BenchLM preserves sourced published rows, so unmatched settings support a directional read, not a precise ranking.

Can LiveCodeBench pick the best coding agent?

No. A high score supports competitive-programming performance under the reported setup. It does not establish codebase navigation, issue interpretation, tool use, patch review, or regression safety. Pair LiveCodeBench with a repository benchmark such as SWE-bench Pro and a trial drawn from your own languages, libraries, and runtime constraints.

Compare Top Models on LiveCodeBench

Qwen3.7 Max vs Qwen3.7 Plus Qwen3.7 Plus vs GLM-4.7 GLM-4.7 vs Qwen3.6-27B Qwen3.6-27B vs Qwen3.6-35B-A3B

Learn More

Read our explainer: LiveCodeBench benchmark deep dive

Last updated: July 20, 2026 · BenchLM version Rolling 2026 set

Choose a model with this week’s evidence

Join 2,000+ readers for ranking moves, pricing changes, and the claims that still need proof.

One email each week. Unsubscribe anytime.