Skip to main content

SWE-Rebench

A continuously updated software engineering benchmark by Nebius using fresh GitHub issues to avoid contamination. Models are evaluated 5 times per problem under a fixed ReAct scaffolding; the Resolved Rate (best pass@1) is reported.

Top models on SWE-Rebench — May 7, 2026

As of May 7, 2026, Claude Opus 4.6 leads the SWE-Rebench leaderboard with 65.3% , followed by GLM-5 (62.8%) and GLM-5.1 (62.7%).

13 modelsCoding31% of category scoreCurrentUpdated May 7, 2026

According to BenchLM.ai, Claude Opus 4.6 leads the SWE-Rebench benchmark with a score of 65.3%, followed by GLM-5 (62.8%) and GLM-5.1 (62.7%). The top models are clustered within 2.6 points, suggesting this benchmark is nearing saturation for frontier models.

13 models have been evaluated on SWE-Rebench. The benchmark falls in the Coding category. This category carries a 20% weight in BenchLM.ai's overall scoring system. Within that category, SWE-Rebench contributes 31% of the category score, so strong performance here directly affects a model's overall ranking.

About SWE-Rebench

Year

2026

Tasks

Fresh GitHub issues (rolling window)

Format

Code patch generation

Difficulty

Professional software engineering

SWE-Rebench uses a rolling window of fresh problems from real GitHub repositories, sourced after each model's release date to prevent contamination. Each model runs 5 times with a standardized 128K-context ReAct scaffold. Unlike SWE-bench Verified (2023 problems), scores reflect consistent, up-to-date difficulty.

BenchLM freshness & provenance

Version

Rolling 2026 window

Refresh cadence

Rolling

Staleness state

Current

Question availability

Rolling public issues

Current

BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.

Leaderboard (13 models)

1
65.3%
2
62.8%
3
62.7%
4
60.9%
5
60.7%
6
58.9%
7
58.7%
8
58.5%
9
58.2%
10
58%
11
53.7%
12
51.9%
13
41.6%

FAQ

What does SWE-Rebench measure?

A continuously updated software engineering benchmark by Nebius using fresh GitHub issues to avoid contamination. Models are evaluated 5 times per problem under a fixed ReAct scaffolding; the Resolved Rate (best pass@1) is reported.

Which model scores highest on SWE-Rebench?

Claude Opus 4.6 by Anthropic currently leads with a score of 65.3% on SWE-Rebench.

How many models are evaluated on SWE-Rebench?

13 AI models have been evaluated on SWE-Rebench on BenchLM.

Last updated: May 7, 2026 · BenchLM version Rolling 2026 window

The AI models change fast. We track them for you.

For engineers, researchers, and the plain curious — a weekly brief on new models, ranking shifts, and pricing changes.

Free. No spam. Unsubscribe anytime.