A continuously updated software engineering benchmark by Nebius using fresh GitHub issues to avoid contamination. Models are evaluated 5 times per problem under a fixed ReAct scaffolding; the Resolved Rate (best pass@1) is reported.
As of March 2026, Claude Opus 4.6 leads the SWE-Rebench leaderboard with 65.3% , followed by GLM-5 (62.8%) and Gemini 3.1 Pro (62.3%).
Claude Opus 4.6
Anthropic
GLM-5
Zhipu AI
Gemini 3.1 Pro
According to BenchLM.ai, Claude Opus 4.6 leads the SWE-Rebench benchmark with a score of 65.3%, followed by GLM-5 (62.8%) and Gemini 3.1 Pro (62.3%). The top models are clustered within 3.0 points, suggesting this benchmark is nearing saturation for frontier models.
14 models have been evaluated on SWE-Rebench. The benchmark falls in the Coding category, which carries a 20% weight in BenchLM.ai's overall scoring system. Within that category, SWE-Rebench contributes 35% of the category score, so strong performance here directly affects a model's overall ranking.
Year
2026
Tasks
Fresh GitHub issues (rolling window)
Format
Code patch generation
Difficulty
Professional software engineering
SWE-Rebench uses a rolling window of fresh problems from real GitHub repositories, sourced after each model's release date to prevent contamination. Each model runs 5 times with a standardized 128K-context ReAct scaffold. Unlike SWE-bench Verified (2023 problems), scores reflect consistent, up-to-date difficulty.
SWE-Rebench: Contamination-Free Evaluation of Software Engineering AgentsA continuously updated software engineering benchmark by Nebius using fresh GitHub issues to avoid contamination. Models are evaluated 5 times per problem under a fixed ReAct scaffolding; the Resolved Rate (best pass@1) is reported.
Claude Opus 4.6 by Anthropic currently leads with a score of 65.3% on SWE-Rebench.
14 AI models have been evaluated on SWE-Rebench on BenchLM.
Get notified when new models drop, benchmark scores change, or the leaderboard shifts. One email per week.
Free. No spam. Unsubscribe anytime. We only store derived location metadata for consent routing.