Scientific Code Benchmark (SciCode)

Name: Scientific Code Benchmark
Creator: BenchLM

SciCode evaluates language models on generating code for realistic scientific research problems across 16 subfields of physics, math, chemistry, biology, and material science. Problems decompose into 338 subproblems requiring domain knowledge recall, scientific reasoning, and precise code synthesis. Based on real scripts from published research.

Top models on SciCode — May 22, 2026

As of May 22, 2026, Qwen3.7 Max leads the SciCode leaderboard with 53.5% , followed by Gemini 3.5 Flash (53.1%) and Kimi K2.6 (52.2%).

1Closed

Qwen3.7 Max

Alibaba

53.5%

Overall 92Context 1M

2Closed

Gemini 3.5 Flash

Google

53.1%

Overall 87Context 1M

3Open

Kimi K2.6

Moonshot AI

52.2%

Overall 85Context 256K

9 modelsCoding10% of category scoreRefreshingUpdated May 22, 2026

According to BenchLM.ai, Qwen3.7 Max leads the SciCode benchmark with a score of 53.5%, followed by Gemini 3.5 Flash (53.1%) and Kimi K2.6 (52.2%). The top models are clustered within 1.3 points, suggesting this benchmark is nearing saturation for frontier models.

9 models have been evaluated on SciCode. The benchmark falls in the Coding category. This category carries a 20% weight in BenchLM.ai's overall scoring system. Within that category, SciCode contributes 10% of the category score, so strong performance here directly affects a model's overall ranking.

About SciCode

Year

2024

Tasks

BenchLM freshness & provenance

Version

SciCode 2024

Refresh cadence

Annual

Staleness state

Refreshing

Question availability

Public benchmark set

Refreshing

BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.