SWE-Rebench (SWE-Rebench)

A continuously updated software engineering benchmark by Nebius using fresh GitHub issues to avoid contamination. Models are evaluated 5 times per problem under a fixed ReAct scaffolding; the Resolved Rate (best pass@1) is reported.

Top Models on SWE-Rebench — March 2026

As of March 2026, Claude Opus 4.6 leads the SWE-Rebench leaderboard with 65.3% , followed by GLM-5 (62.8%) and Gemini 3.1 Pro (62.3%).

14 modelsCoding35% of category scoreUpdated March 18, 2026

According to BenchLM.ai, Claude Opus 4.6 leads the SWE-Rebench benchmark with a score of 65.3%, followed by GLM-5 (62.8%) and Gemini 3.1 Pro (62.3%). The top models are clustered within 3.0 points, suggesting this benchmark is nearing saturation for frontier models.

14 models have been evaluated on SWE-Rebench. The benchmark falls in the Coding category, which carries a 20% weight in BenchLM.ai's overall scoring system. Within that category, SWE-Rebench contributes 35% of the category score, so strong performance here directly affects a model's overall ranking.

About SWE-Rebench

Year

2026

Tasks

Fresh GitHub issues (rolling window)

Format

Code patch generation

Difficulty

Professional software engineering

SWE-Rebench uses a rolling window of fresh problems from real GitHub repositories, sourced after each model's release date to prevent contamination. Each model runs 5 times with a standardized 128K-context ReAct scaffold. Unlike SWE-bench Verified (2023 problems), scores reflect consistent, up-to-date difficulty.

SWE-Rebench: Contamination-Free Evaluation of Software Engineering Agents

Leaderboard (14 models)

#1Claude Opus 4.6
65.3%
#2GLM-5
62.8%
#3Gemini 3.1 Pro
62.3%
#4DeepSeek V3.2
60.9%
#5Claude Sonnet 4.6
60.7%
#6Claude Sonnet 4.5
60%
#8Kimi K2.5
58.5%
#9GPT-5.3 Codex
58.2%
#10Kimi K2.5 (Reasoning)
57.4%
#11GPT-5.2-Codex
56.8%
#12Gemini 3 Flash
52.5%
#13GLM-4.5-Air
38.3%
#14GPT-OSS 120B
33.3%

FAQ

What does SWE-Rebench measure?

A continuously updated software engineering benchmark by Nebius using fresh GitHub issues to avoid contamination. Models are evaluated 5 times per problem under a fixed ReAct scaffolding; the Resolved Rate (best pass@1) is reported.

Which model scores highest on SWE-Rebench?

Claude Opus 4.6 by Anthropic currently leads with a score of 65.3% on SWE-Rebench.

How many models are evaluated on SWE-Rebench?

14 AI models have been evaluated on SWE-Rebench on BenchLM.

Last updated: March 18, 2026

Weekly LLM Benchmark Digest

Get notified when new models drop, benchmark scores change, or the leaderboard shifts. One email per week.

Free. No spam. Unsubscribe anytime. We only store derived location metadata for consent routing.