Skip to main content

SWE-Atlas Refactoring

A Scale SWE-Atlas software-engineering agent benchmark focused on refactoring tasks.

How BenchLM shows SWE-Atlas Refactoring

BenchLM mirrors the public Scale SWE-Atlas Refactoring leaderboard from June 19, 2026 snapshot. The source reports 13 agent/model rows with confidence intervals and harness labels such as Claude Code, Codex, Gemini CLI, and Mini-SWE-Agent.

SWE-Atlas Refactoring is display only on BenchLM. It is useful evidence about software-engineering agents, but the rows mix base model quality with agent harness choices, so BenchLM keeps it out of weighted model-only rankings.

13 model variantsSWE-Atlas task familyRefactoring scoreScale Labs sourceDisplay only

Refactoring score on SWE-Atlas Refactoring — June 19, 2026 snapshot

BenchLM mirrors the published refactoring score view for SWE-Atlas Refactoring. Fable-5 (Claude Code) xHigh leads the public snapshot at 54.8% , followed by Claude Opus 4.7 (Adaptive) (48.6%) and Opus 4.8 (Claude Code)\n (46.7%). BenchLM does not use these results to rank models overall.

13 modelsAgenticCurrentDisplay onlyUpdated June 19, 2026 snapshot

The published SWE-Atlas Refactoring snapshot is tightly clustered at the top: Fable-5 (Claude Code) xHigh sits at 54.8%, while the third row is only 8.1 points behind. The broader top-10 spread is 30.5 points, so the benchmark still separates strong models even when the leaders cluster.

13 models have been evaluated on SWE-Atlas Refactoring. The benchmark falls in the Agentic category. This category carries a 22% weight in BenchLM.ai's overall scoring system. SWE-Atlas Refactoring is currently displayed for reference but excluded from the scoring formula, so it does not directly affect overall rankings.

About SWE-Atlas Refactoring

Year

2026

Tasks

SWE-Atlas refactoring tasks

Format

Refactoring score with confidence intervals

Difficulty

Real-world software-engineering agent tasks

BenchLM mirrors the public Scale SWE-Atlas Refactoring leaderboard as a display-only agentic software-engineering benchmark. The source compares model-agent combinations such as Claude Code, Codex, Gemini CLI, and Mini-SWE-Agent.

BenchLM freshness & provenance

Version

SWE-Atlas Refactoring 2026

Refresh cadence

Quarterly

Staleness state

Current

Question availability

Public benchmark set

CurrentDisplay only

BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.

Refactoring score table (13 models)

2
Claude Opus 4.7 (Adaptive)Opus-4.7 (Claude Code)
48.6%
4
GPT-5.5GPT-5.5 (Codex) xHigh
44.8%
5
GPT-5.4GPT 5.4 (Codex) xHigh
44.3%
6
GPT-5.3 CodexGPT 5.3 (Codex) xHigh
42.4%
7
Claude Opus 4.6Opus-4.6 (Claude Code)
35.6%
8
Gemini 3.1 ProGemini-3.1-Pro (Gemini CLI)
33.8%
9
Claude Sonnet 4.6Sonnet-4.6 (Claude Code)
32.2%
10
GLM-5Glm-5 (Mini-SWE-Agent)
24.2%
11
Kimi K2.5Kimi-K2.5 (Mini-SWE-Agent)
20.9%
12
MiniMax M2.5Minimax-M2.5 (Mini-SWE-Agent)
19.5%
13
Gemini 3 FlashGemini-3-Flash (Mini-SWE-Agent)
10.0%

FAQ

What does SWE-Atlas Refactoring measure?

A Scale SWE-Atlas software-engineering agent benchmark focused on refactoring tasks.

Which model leads the published SWE-Atlas Refactoring snapshot?

Fable-5 (Claude Code) xHigh currently leads the published SWE-Atlas Refactoring snapshot with 54.8% refactoring score. BenchLM shows this benchmark for display only and does not use it in overall rankings.

How many models are evaluated on SWE-Atlas Refactoring?

13 AI models are included in BenchLM's mirrored SWE-Atlas Refactoring snapshot, based on the public leaderboard captured on June 19, 2026 snapshot.

Last updated: June 19, 2026 snapshot · mirrored from the public benchmark leaderboard

The AI models change fast. We track them for you.

For engineers, researchers, and the plain curious — a weekly brief on new models, ranking shifts, and pricing changes.

Free. No spam. Unsubscribe anytime.