Skip to main content

SWE-Atlas Refactoring

A Scale SWE-Atlas software-engineering agent benchmark focused on refactoring tasks.

How BenchLM shows SWE-Atlas Refactoring

BenchLM mirrors the public Scale SWE-Atlas Refactoring leaderboard from May 11, 2026 snapshot. The source reports 11 agent/model rows with confidence intervals and harness labels such as Claude Code, Codex, Gemini CLI, and Mini-SWE-Agent.

SWE-Atlas Refactoring is display only on BenchLM. It is useful evidence about software-engineering agents, but the rows mix base model quality with agent harness choices, so BenchLM keeps it out of weighted model-only rankings.

11 model variantsSWE-Atlas task familyRefactoring scoreScale Labs sourceDisplay only

Refactoring score on SWE-Atlas Refactoring — May 11, 2026 snapshot

BenchLM mirrors the published refactoring score view for SWE-Atlas Refactoring. Claude Opus 4.7 (Adaptive) leads the public snapshot at 48.6% , followed by GPT-5.5 (44.8%) and GPT-5.4 (44.3%). BenchLM does not use these results to rank models overall.

11 modelsAgenticCurrentDisplay onlyUpdated May 11, 2026 snapshot

The published SWE-Atlas Refactoring snapshot is tightly clustered at the top: Claude Opus 4.7 (Adaptive) sits at 48.6%, while the third row is only 4.3 points behind. The broader top-10 spread is 29.1 points, so the benchmark still separates strong models even when the leaders cluster.

11 models have been evaluated on SWE-Atlas Refactoring. The benchmark falls in the Agentic category. This category carries a 22% weight in BenchLM.ai's overall scoring system. SWE-Atlas Refactoring is currently displayed for reference but excluded from the scoring formula, so it does not directly affect overall rankings.

About SWE-Atlas Refactoring

Year

2026

Tasks

SWE-Atlas refactoring tasks

Format

Refactoring score with confidence intervals

Difficulty

Real-world software-engineering agent tasks

BenchLM mirrors the public Scale SWE-Atlas Refactoring leaderboard as a display-only agentic software-engineering benchmark. The source compares model-agent combinations such as Claude Code, Codex, Gemini CLI, and Mini-SWE-Agent.

BenchLM freshness & provenance

Version

SWE-Atlas Refactoring 2026

Refresh cadence

Quarterly

Staleness state

Current

Question availability

Public benchmark set

CurrentDisplay only

BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.

Refactoring score table (11 models)

1
Claude Opus 4.7 (Adaptive)Opus-4.7 (Claude Code)
48.6%
2
GPT-5.5Gpt-5.5 (Codex)
44.8%
3
GPT-5.4Gpt-5.4 (Codex)
44.3%
4
GPT-5.3 CodexGpt-5.3 (Codex)
42.4%
5
Claude Opus 4.6Opus-4.6 (Claude Code)
35.6%
6
Gemini 3.1 ProGemini-3.1-Pro (Gemini CLI)
33.8%
7
Claude Sonnet 4.6Sonnet-4.6 (Claude Code)
32.2%
8
GLM-5Glm-5 (Mini-SWE-Agent)
24.2%
9
Kimi K2.5Kimi-K2.5 (Mini-SWE-Agent)
20.9%
10
MiniMax M2.5Minimax-M2.5 (Mini-SWE-Agent)
19.5%
11
Gemini 3 FlashGemini-3-Flash (Mini-SWE-Agent)
10.0%

FAQ

What does SWE-Atlas Refactoring measure?

A Scale SWE-Atlas software-engineering agent benchmark focused on refactoring tasks.

Which model leads the published SWE-Atlas Refactoring snapshot?

Claude Opus 4.7 (Adaptive) currently leads the published SWE-Atlas Refactoring snapshot with a refactoring score of 48.6%. BenchLM shows this benchmark for display only and does not use it in overall rankings.

How many models are evaluated on SWE-Atlas Refactoring?

11 AI models are included in BenchLM's mirrored SWE-Atlas Refactoring snapshot, based on the public leaderboard captured on May 11, 2026 snapshot.

Last updated: May 11, 2026 snapshot · mirrored from the public benchmark leaderboard

The AI models change fast. We track them for you.

For engineers, researchers, and the plain curious — a weekly brief on new models, ranking shifts, and pricing changes.

Free. No spam. Unsubscribe anytime.