A continuously updated software engineering benchmark by Nebius using fresh GitHub issues to avoid contamination. Models are evaluated 5 times per problem under a fixed ReAct scaffolding; the Resolved Rate (best pass@1) is reported.
As of April 10, 2026, Claude Opus 4.6 leads the SWE-Rebench leaderboard with 65.3% , followed by GLM-5 (62.8%) and DeepSeek V3.2 (60.9%).
Claude Opus 4.6
Anthropic
GLM-5
Z.AI
DeepSeek V3.2
DeepSeek
According to BenchLM.ai, Claude Opus 4.6 leads the SWE-Rebench benchmark with a score of 65.3%, followed by GLM-5 (62.8%) and DeepSeek V3.2 (60.9%). The scores show moderate spread, with meaningful differences between the top tier and mid-tier models.
6 models have been evaluated on SWE-Rebench. The benchmark falls in the Coding category. This category carries a 20% weight in BenchLM.ai's overall scoring system. Within that category, SWE-Rebench contributes 31% of the category score, so strong performance here directly affects a model's overall ranking.
Year
2026
Tasks
Fresh GitHub issues (rolling window)
Format
Code patch generation
Difficulty
Professional software engineering
SWE-Rebench uses a rolling window of fresh problems from real GitHub repositories, sourced after each model's release date to prevent contamination. Each model runs 5 times with a standardized 128K-context ReAct scaffold. Unlike SWE-bench Verified (2023 problems), scores reflect consistent, up-to-date difficulty.
Version
Rolling 2026 window
Refresh cadence
Rolling
Staleness state
Current
Question availability
Rolling public issues
BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.
A continuously updated software engineering benchmark by Nebius using fresh GitHub issues to avoid contamination. Models are evaluated 5 times per problem under a fixed ReAct scaffolding; the Resolved Rate (best pass@1) is reported.
Claude Opus 4.6 by Anthropic currently leads with a score of 65.3% on SWE-Rebench.
6 AI models have been evaluated on SWE-Rebench on BenchLM.
For engineers, researchers, and the plain curious — a weekly brief on new models, ranking shifts, and pricing changes.
Free. No spam. Unsubscribe anytime.