Skip to main content

Scientific Code Benchmark (SciCode)

SciCode evaluates language models on generating code for realistic scientific research problems across 16 subfields of physics, math, chemistry, biology, and material science. Problems decompose into 338 subproblems requiring domain knowledge recall, scientific reasoning, and precise code synthesis. Based on real scripts from published research.

Top models on SciCode — May 22, 2026

As of May 22, 2026, Qwen3.7 Max leads the SciCode leaderboard with 53.5% , followed by Gemini 3.5 Flash (53.1%) and Kimi K2.6 (52.2%).

9 modelsCoding10% of category scoreRefreshingUpdated May 22, 2026

According to BenchLM.ai, Qwen3.7 Max leads the SciCode benchmark with a score of 53.5%, followed by Gemini 3.5 Flash (53.1%) and Kimi K2.6 (52.2%). The top models are clustered within 1.3 points, suggesting this benchmark is nearing saturation for frontier models.

9 models have been evaluated on SciCode. The benchmark falls in the Coding category. This category carries a 20% weight in BenchLM.ai's overall scoring system. Within that category, SciCode contributes 10% of the category score, so strong performance here directly affects a model's overall ranking.

About SciCode

Year

2024

Tasks

80

BenchLM freshness & provenance

Version

SciCode 2024

Refresh cadence

Annual

Staleness state

Refreshing

Question availability

Public benchmark set

Refreshing

BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.

Leaderboard (9 models)

1
53.5%
2
53.1%
3
52.2%
4
48.7%
5
47.3%
6
47%
7
41.2%
9
27%

FAQ

What does SciCode measure?

SciCode evaluates language models on generating code for realistic scientific research problems across 16 subfields of physics, math, chemistry, biology, and material science. Problems decompose into 338 subproblems requiring domain knowledge recall, scientific reasoning, and precise code synthesis. Based on real scripts from published research.

Which model scores highest on SciCode?

Qwen3.7 Max by Alibaba currently leads with a score of 53.5% on SciCode.

How many models are evaluated on SciCode?

9 AI models have been evaluated on SciCode on BenchLM.

Last updated: May 22, 2026 · BenchLM version SciCode 2024

The AI models change fast. We track them for you.

For engineers, researchers, and the plain curious — a weekly brief on new models, ranking shifts, and pricing changes.

Free. No spam. Unsubscribe anytime.