Skip to main content

SWE-Rebench

A continuously updated software engineering benchmark by Nebius using fresh GitHub issues to avoid contamination. Models are evaluated 5 times per problem under a fixed ReAct scaffolding; the Resolved Rate (best pass@1) is reported.

Top models on SWE-Rebench — April 10, 2026

As of April 10, 2026, Claude Opus 4.6 leads the SWE-Rebench leaderboard with 65.3% , followed by GLM-5 (62.8%) and DeepSeek V3.2 (60.9%).

6 modelsCoding31% of category scoreCurrentUpdated April 10, 2026

According to BenchLM.ai, Claude Opus 4.6 leads the SWE-Rebench benchmark with a score of 65.3%, followed by GLM-5 (62.8%) and DeepSeek V3.2 (60.9%). The scores show moderate spread, with meaningful differences between the top tier and mid-tier models.

6 models have been evaluated on SWE-Rebench. The benchmark falls in the Coding category. This category carries a 20% weight in BenchLM.ai's overall scoring system. Within that category, SWE-Rebench contributes 31% of the category score, so strong performance here directly affects a model's overall ranking.

About SWE-Rebench

Year

2026

Tasks

Fresh GitHub issues (rolling window)

Format

Code patch generation

Difficulty

Professional software engineering

SWE-Rebench uses a rolling window of fresh problems from real GitHub repositories, sourced after each model's release date to prevent contamination. Each model runs 5 times with a standardized 128K-context ReAct scaffold. Unlike SWE-bench Verified (2023 problems), scores reflect consistent, up-to-date difficulty.

BenchLM freshness & provenance

Version

Rolling 2026 window

Refresh cadence

Rolling

Staleness state

Current

Question availability

Rolling public issues

Current

BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.

Leaderboard (6 models)

1
65.3%
2
62.8%
3
60.9%
4
60.7%
5
58.5%
6
58.2%

FAQ

What does SWE-Rebench measure?

A continuously updated software engineering benchmark by Nebius using fresh GitHub issues to avoid contamination. Models are evaluated 5 times per problem under a fixed ReAct scaffolding; the Resolved Rate (best pass@1) is reported.

Which model scores highest on SWE-Rebench?

Claude Opus 4.6 by Anthropic currently leads with a score of 65.3% on SWE-Rebench.

How many models are evaluated on SWE-Rebench?

6 AI models have been evaluated on SWE-Rebench on BenchLM.

Last updated: April 10, 2026 · BenchLM version Rolling 2026 window

The AI models change fast. We track them for you.

For engineers, researchers, and the plain curious — a weekly brief on new models, ranking shifts, and pricing changes.

Free. No spam. Unsubscribe anytime.