A continuously updated software engineering benchmark by Nebius using fresh GitHub issues to avoid contamination. Models are evaluated 5 times per problem under a fixed ReAct scaffolding; the Resolved Rate (best pass@1) is reported.
As of May 7, 2026, Claude Opus 4.6 leads the SWE-Rebench leaderboard with 65.3% , followed by GLM-5 (62.8%) and GLM-5.1 (62.7%).
Claude Opus 4.6
Anthropic
GLM-5
Z.AI
GLM-5.1
Z.AI
According to BenchLM.ai, Claude Opus 4.6 leads the SWE-Rebench benchmark with a score of 65.3%, followed by GLM-5 (62.8%) and GLM-5.1 (62.7%). The top models are clustered within 2.6 points, suggesting this benchmark is nearing saturation for frontier models.
13 models have been evaluated on SWE-Rebench. The benchmark falls in the Coding category. This category carries a 20% weight in BenchLM.ai's overall scoring system. Within that category, SWE-Rebench contributes 31% of the category score, so strong performance here directly affects a model's overall ranking.
Year
2026
Tasks
Fresh GitHub issues (rolling window)
Format
Code patch generation
Difficulty
Professional software engineering
SWE-Rebench uses a rolling window of fresh problems from real GitHub repositories, sourced after each model's release date to prevent contamination. Each model runs 5 times with a standardized 128K-context ReAct scaffold. Unlike SWE-bench Verified (2023 problems), scores reflect consistent, up-to-date difficulty.
Version
Rolling 2026 window
Refresh cadence
Rolling
Staleness state
Current
Question availability
Rolling public issues
BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.
A continuously updated software engineering benchmark by Nebius using fresh GitHub issues to avoid contamination. Models are evaluated 5 times per problem under a fixed ReAct scaffolding; the Resolved Rate (best pass@1) is reported.
Claude Opus 4.6 by Anthropic currently leads with a score of 65.3% on SWE-Rebench.
13 AI models have been evaluated on SWE-Rebench on BenchLM.
For engineers, researchers, and the plain curious — a weekly brief on new models, ranking shifts, and pricing changes.
Free. No spam. Unsubscribe anytime.