Senior SWE-Bench

Name: Senior SWE-Bench
Creator: BenchLM

A Snorkel AI benchmark that evaluates coding agents on senior-level engineering work: building features from realistically under-specified instructions, investigating bugs that require runtime investigation, and shipping code that matches existing codebase conventions.

How BenchLM shows Senior SWE-Bench

BenchLM mirrors the official Senior SWE-Bench leaderboard published by Snorkel AI for the v2026.06 release. The benchmark contains 100 long-horizon tasks sourced from real pull requests across 12 production repositories, with 50 tasks public and 50 held private to mitigate contamination.

This page ranks models by the primary published metric: tasteful solve rate (pass@1), scored by a taste judge and a validation agent judge that Snorkel calibrated against reviews from its senior software engineering expert network. Snorkel removes runs with detected reward hacking, such as agents searching GitHub for the original pull request, from published scores.

Senior SWE-Bench is display only on BenchLM. The published rows are long-horizon agent-harness results with judge-based scoring rather than normalized model-only comparisons, and half the task suite is private, so BenchLM does not use these scores as weighted ranking inputs.

9 model rows100 tasks (50 public)12 source repositoriesTasteful solve rate (pass@1)Display only

Senior SWE-Bench site How it works GitHub repository (Harbor dataset)Snorkel leaderboards

Tasteful solve rate (pass@1) on Senior SWE-Bench — v2026.06 release

BenchLM mirrors the published tasteful solve rate (pass@1) view for Senior SWE-Bench. Claude Opus 4.8 leads the public snapshot at 24.0% , followed by Claude Sonnet 5 (19.4%) and GPT-5.5 (16.0%). BenchLM does not use these results to rank models overall.

1Closed

Claude Opus 4.8

Anthropic

24.0%

Overall 92Context 1M

2Closed

Claude Sonnet 5

Anthropic

19.4%

Overall —Context 1M

3Closed

GPT-5.5

OpenAI

16.0%

Overall 87Context 1M

9 modelsCodingCurrentDisplay onlyUpdated v2026.06 release

The published Senior SWE-Bench snapshot is tightly clustered at the top: Claude Opus 4.8 sits at 24.0%, while the third row is only 8.0 points behind. The broader top-10 spread is 21.0 points, so the benchmark still separates strong models even when the leaders cluster.

9 models have been evaluated on Senior SWE-Bench. The benchmark falls in the Coding category. This category carries a 20% weight in BenchLM.ai's overall scoring system. Senior SWE-Bench is currently displayed for reference but excluded from the scoring formula, so it does not directly affect overall rankings.

About Senior SWE-Bench

Year

2026

Tasks

100 tasks (50 public) from 12 repositories

Format

Long-horizon coding agent evaluation with judge scoring

Difficulty

Senior software engineering

Senior SWE-Bench v2026.06 contains 100 long-horizon tasks sourced from real pull requests across 12 production repositories, with 50 tasks public and 50 held private to mitigate contamination. Tasks run in the open-source Harbor harness and are scored with a taste judge and a validation agent judge, both calibrated against reviews from Snorkel's senior software engineering expert network. The primary metric is tasteful solve rate (pass@1), and runs with detected reward hacking are removed from scores.

Senior SWE-Bench Public benchmark source

BenchLM freshness & provenance

Version

Senior SWE-Bench v2026.06

Refresh cadence

Quarterly

Staleness state

Current

Question availability

50 of 100 tasks public as a Harbor dataset

CurrentDisplay only

BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.

Tasteful solve rate (pass@1) table (9 models)

Claude Opus 4.8

AnthropicClosed

24.0%

Claude Sonnet 5

AnthropicClosed

19.4%

GPT-5.5

OpenAIClosed

16.0%

Claude Opus 4.7

AnthropicClosed

14.1%

GLM-5.2

Z.AIOpen

12.5%

Kimi K2.6

Moonshot AIOpen

8.2%

Claude Sonnet 4.6

AnthropicClosed

8.2%

Gemini 3.1 Pro

GoogleClosed

6.1%

Gemini 3.5 Flash

GoogleClosed

3.0%

FAQ

What does Senior SWE-Bench measure?

Which model leads the published Senior SWE-Bench snapshot?

Claude Opus 4.8 currently leads the published Senior SWE-Bench snapshot with 24.0% tasteful solve rate (pass@1). BenchLM shows this benchmark for display only and does not use it in overall rankings.

How many models are evaluated on Senior SWE-Bench?

9 AI models are included in BenchLM's mirrored Senior SWE-Bench snapshot, based on the public leaderboard captured on v2026.06 release.

Last updated: v2026.06 release · mirrored from the public benchmark leaderboard

The AI models change fast. We track them for you.

For engineers, researchers, and the plain curious — a weekly brief on new models, ranking shifts, and pricing changes.

Free. No spam. Unsubscribe anytime.