SWE-Atlas Refactoring

Name: SWE-Atlas Refactoring
Creator: BenchLM

A Scale SWE-Atlas software-engineering agent benchmark focused on refactoring tasks.

How BenchLM shows SWE-Atlas Refactoring

BenchLM mirrors the public Scale SWE-Atlas Refactoring leaderboard from May 11, 2026 snapshot. The source reports 11 agent/model rows with confidence intervals and harness labels such as Claude Code, Codex, Gemini CLI, and Mini-SWE-Agent.

SWE-Atlas Refactoring is display only on BenchLM. It is useful evidence about software-engineering agents, but the rows mix base model quality with agent harness choices, so BenchLM keeps it out of weighted model-only rankings.

11 model variantsSWE-Atlas task familyRefactoring scoreScale Labs sourceDisplay only

SWE-Atlas Refactoring leaderboard SWE-Atlas paper

Refactoring score on SWE-Atlas Refactoring — May 11, 2026 snapshot

BenchLM mirrors the published refactoring score view for SWE-Atlas Refactoring. Claude Opus 4.7 (Adaptive) leads the public snapshot at 48.6% , followed by GPT-5.5 (44.8%) and GPT-5.4 (44.3%). BenchLM does not use these results to rank models overall.

1Closed

Claude Opus 4.7 (Adaptive)

Anthropic

Opus-4.7 (Claude Code)

48.6%

Overall 90Context 1M

2Closed

GPT-5.5

OpenAI

Gpt-5.5 (Codex)

44.8%

Overall 91Context 1M

3Closed

GPT-5.4

OpenAI

Gpt-5.4 (Codex)

44.3%

Overall 89Context 1.05M

11 modelsAgenticCurrentDisplay onlyUpdated May 11, 2026 snapshot

The published SWE-Atlas Refactoring snapshot is tightly clustered at the top: Claude Opus 4.7 (Adaptive) sits at 48.6%, while the third row is only 4.3 points behind. The broader top-10 spread is 29.1 points, so the benchmark still separates strong models even when the leaders cluster.

11 models have been evaluated on SWE-Atlas Refactoring. The benchmark falls in the Agentic category. This category carries a 22% weight in BenchLM.ai's overall scoring system. SWE-Atlas Refactoring is currently displayed for reference but excluded from the scoring formula, so it does not directly affect overall rankings.

About SWE-Atlas Refactoring

Year

2026

Tasks

SWE-Atlas refactoring tasks

Format

Refactoring score with confidence intervals

Difficulty

Real-world software-engineering agent tasks

BenchLM mirrors the public Scale SWE-Atlas Refactoring leaderboard as a display-only agentic software-engineering benchmark. The source compares model-agent combinations such as Claude Code, Codex, Gemini CLI, and Mini-SWE-Agent.

SWE-Atlas Public benchmark source

BenchLM freshness & provenance

Version

SWE-Atlas Refactoring 2026

Refresh cadence

Quarterly

Staleness state

Current

Question availability

Public benchmark set

CurrentDisplay only

BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.

Refactoring score table (11 models)

Claude Opus 4.7 (Adaptive)Opus-4.7 (Claude Code)

AnthropicClosed

48.6%

GPT-5.5Gpt-5.5 (Codex)

OpenAIClosed

44.8%

GPT-5.4Gpt-5.4 (Codex)

OpenAIClosed

44.3%

GPT-5.3 CodexGpt-5.3 (Codex)

OpenAIClosed

42.4%

Claude Opus 4.6Opus-4.6 (Claude Code)

AnthropicClosed

35.6%

Gemini 3.1 ProGemini-3.1-Pro (Gemini CLI)

GoogleClosed

33.8%

Claude Sonnet 4.6Sonnet-4.6 (Claude Code)

AnthropicClosed

32.2%

GLM-5Glm-5 (Mini-SWE-Agent)

Z.AIOpen

24.2%

Kimi K2.5Kimi-K2.5 (Mini-SWE-Agent)

Moonshot AIOpen

20.9%

MiniMax M2.5Minimax-M2.5 (Mini-SWE-Agent)

MiniMaxClosed

19.5%

Gemini 3 FlashGemini-3-Flash (Mini-SWE-Agent)

GoogleClosed

10.0%

FAQ

What does SWE-Atlas Refactoring measure?

A Scale SWE-Atlas software-engineering agent benchmark focused on refactoring tasks.

Which model leads the published SWE-Atlas Refactoring snapshot?

Claude Opus 4.7 (Adaptive) currently leads the published SWE-Atlas Refactoring snapshot with a refactoring score of 48.6%. BenchLM shows this benchmark for display only and does not use it in overall rankings.

How many models are evaluated on SWE-Atlas Refactoring?

11 AI models are included in BenchLM's mirrored SWE-Atlas Refactoring snapshot, based on the public leaderboard captured on May 11, 2026 snapshot.

Last updated: May 11, 2026 snapshot · mirrored from the public benchmark leaderboard

The AI models change fast. We track them for you.

For engineers, researchers, and the plain curious — a weekly brief on new models, ranking shifts, and pricing changes.

Free. No spam. Unsubscribe anytime.