SWE-Rebench

Name: SWE-Rebench
Creator: BenchLM

A continuously updated software engineering benchmark by Nebius using fresh GitHub issues to avoid contamination. Models are evaluated 5 times per problem under a fixed ReAct scaffolding; the Resolved Rate (best pass@1) is reported.

Top models on SWE-Rebench — May 7, 2026

As of May 7, 2026, Claude Opus 4.6 leads the SWE-Rebench leaderboard with 65.3% , followed by GLM-5 (62.8%) and GLM-5.1 (62.7%).

1Closed

Claude Opus 4.6

Anthropic

65.3%

Overall 87Context 1M

2Open

GLM-5

Z.AI

62.8%

Overall 67Context 200K

3Open

GLM-5.1

Z.AI

62.7%

Overall 83Context 203K

13 modelsCoding31% of category scoreCurrentUpdated May 7, 2026

According to BenchLM.ai, Claude Opus 4.6 leads the SWE-Rebench benchmark with a score of 65.3%, followed by GLM-5 (62.8%) and GLM-5.1 (62.7%). The top models are clustered within 2.6 points, suggesting this benchmark is nearing saturation for frontier models.

13 models have been evaluated on SWE-Rebench. The benchmark falls in the Coding category. This category carries a 20% weight in BenchLM.ai's overall scoring system. Within that category, SWE-Rebench contributes 31% of the category score, so strong performance here directly affects a model's overall ranking.

About SWE-Rebench

Year

2026

Tasks

Fresh GitHub issues (rolling window)

Format

Code patch generation

Difficulty

Professional software engineering

SWE-Rebench uses a rolling window of fresh problems from real GitHub repositories, sourced after each model's release date to prevent contamination. Each model runs 5 times with a standardized 128K-context ReAct scaffold. Unlike SWE-bench Verified (2023 problems), scores reflect consistent, up-to-date difficulty.

SWE-Rebench: Contamination-Free Evaluation of Software Engineering Agents

BenchLM freshness & provenance

Version

Rolling 2026 window

Refresh cadence

Rolling

Staleness state

Current

Question availability

Rolling public issues

Current

BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.