Skip to main content

Scientific Code Benchmark (SciCode)

SciCode evaluates language models on generating code for realistic scientific research problems across 16 subfields of physics, math, chemistry, biology, and material science. Problems decompose into 338 subproblems requiring domain knowledge recall, scientific reasoning, and precise code synthesis. Based on real scripts from published research.

Top models on SciCode — April 10, 2026

As of April 10, 2026, Kimi K2.5 leads the SciCode leaderboard with 48.7%.

1 modelsCoding10% of category scoreRefreshingUpdated April 10, 2026

About SciCode

Year

2024

Tasks

80

BenchLM freshness & provenance

Version

SciCode 2024

Refresh cadence

Annual

Staleness state

Refreshing

Question availability

Public benchmark set

Refreshing

BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.

Leaderboard (1 models)

1
48.7%

FAQ

What does SciCode measure?

SciCode evaluates language models on generating code for realistic scientific research problems across 16 subfields of physics, math, chemistry, biology, and material science. Problems decompose into 338 subproblems requiring domain knowledge recall, scientific reasoning, and precise code synthesis. Based on real scripts from published research.

Which model scores highest on SciCode?

Kimi K2.5 by Moonshot AI currently leads with a score of 48.7% on SciCode.

How many models are evaluated on SciCode?

1 AI models have been evaluated on SciCode on BenchLM.

Last updated: April 10, 2026 · BenchLM version SciCode 2024

The AI models change fast. We track them for you.

For engineers, researchers, and the plain curious — a weekly brief on new models, ranking shifts, and pricing changes.

Free. No spam. Unsubscribe anytime.