A Scale SWE-Atlas software-engineering agent benchmark focused on refactoring tasks.
BenchLM mirrors the public Scale SWE-Atlas Refactoring leaderboard from May 11, 2026 snapshot. The source reports 11 agent/model rows with confidence intervals and harness labels such as Claude Code, Codex, Gemini CLI, and Mini-SWE-Agent.
SWE-Atlas Refactoring is display only on BenchLM. It is useful evidence about software-engineering agents, but the rows mix base model quality with agent harness choices, so BenchLM keeps it out of weighted model-only rankings.
BenchLM mirrors the published refactoring score view for SWE-Atlas Refactoring. Claude Opus 4.7 (Adaptive) leads the public snapshot at 48.6% , followed by GPT-5.5 (44.8%) and GPT-5.4 (44.3%). BenchLM does not use these results to rank models overall.
Claude Opus 4.7 (Adaptive)
Anthropic
Opus-4.7 (Claude Code)
GPT-5.5
OpenAI
Gpt-5.5 (Codex)
GPT-5.4
OpenAI
Gpt-5.4 (Codex)
The published SWE-Atlas Refactoring snapshot is tightly clustered at the top: Claude Opus 4.7 (Adaptive) sits at 48.6%, while the third row is only 4.3 points behind. The broader top-10 spread is 29.1 points, so the benchmark still separates strong models even when the leaders cluster.
11 models have been evaluated on SWE-Atlas Refactoring. The benchmark falls in the Agentic category. This category carries a 22% weight in BenchLM.ai's overall scoring system. SWE-Atlas Refactoring is currently displayed for reference but excluded from the scoring formula, so it does not directly affect overall rankings.
Year
2026
Tasks
SWE-Atlas refactoring tasks
Format
Refactoring score with confidence intervals
Difficulty
Real-world software-engineering agent tasks
BenchLM mirrors the public Scale SWE-Atlas Refactoring leaderboard as a display-only agentic software-engineering benchmark. The source compares model-agent combinations such as Claude Code, Codex, Gemini CLI, and Mini-SWE-Agent.
Version
SWE-Atlas Refactoring 2026
Refresh cadence
Quarterly
Staleness state
Current
Question availability
Public benchmark set
BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.
A Scale SWE-Atlas software-engineering agent benchmark focused on refactoring tasks.
Claude Opus 4.7 (Adaptive) currently leads the published SWE-Atlas Refactoring snapshot with a refactoring score of 48.6%. BenchLM shows this benchmark for display only and does not use it in overall rankings.
11 AI models are included in BenchLM's mirrored SWE-Atlas Refactoring snapshot, based on the public leaderboard captured on May 11, 2026 snapshot.
For engineers, researchers, and the plain curious — a weekly brief on new models, ranking shifts, and pricing changes.
Free. No spam. Unsubscribe anytime.