A Snorkel AI benchmark that evaluates coding agents on senior-level engineering work: building features from realistically under-specified instructions, investigating bugs that require runtime investigation, and shipping code that matches existing codebase conventions.
BenchLM mirrors the official Senior SWE-Bench leaderboard published by Snorkel AI for the v2026.06 release. The benchmark contains 100 long-horizon tasks sourced from real pull requests across 12 production repositories, with 50 tasks public and 50 held private to mitigate contamination.
This page ranks models by the primary published metric: tasteful solve rate (pass@1), scored by a taste judge and a validation agent judge that Snorkel calibrated against reviews from its senior software engineering expert network. Snorkel removes runs with detected reward hacking, such as agents searching GitHub for the original pull request, from published scores.
Senior SWE-Bench is display only on BenchLM. The published rows are long-horizon agent-harness results with judge-based scoring rather than normalized model-only comparisons, and half the task suite is private, so BenchLM does not use these scores as weighted ranking inputs.
BenchLM mirrors the published tasteful solve rate (pass@1) view for Senior SWE-Bench. Claude Opus 4.8 leads the public snapshot at 24.0% , followed by Claude Sonnet 5 (19.4%) and GPT-5.5 (16.0%). BenchLM does not use these results to rank models overall.
Claude Opus 4.8
Anthropic
Claude Sonnet 5
Anthropic
GPT-5.5
OpenAI
The published Senior SWE-Bench snapshot is tightly clustered at the top: Claude Opus 4.8 sits at 24.0%, while the third row is only 8.0 points behind. The broader top-10 spread is 21.0 points, so the benchmark still separates strong models even when the leaders cluster.
9 models have been evaluated on Senior SWE-Bench. The benchmark falls in the Coding category. This category carries a 20% weight in BenchLM.ai's overall scoring system. Senior SWE-Bench is currently displayed for reference but excluded from the scoring formula, so it does not directly affect overall rankings.
Year
2026
Tasks
100 tasks (50 public) from 12 repositories
Format
Long-horizon coding agent evaluation with judge scoring
Difficulty
Senior software engineering
Senior SWE-Bench v2026.06 contains 100 long-horizon tasks sourced from real pull requests across 12 production repositories, with 50 tasks public and 50 held private to mitigate contamination. Tasks run in the open-source Harbor harness and are scored with a taste judge and a validation agent judge, both calibrated against reviews from Snorkel's senior software engineering expert network. The primary metric is tasteful solve rate (pass@1), and runs with detected reward hacking are removed from scores.
Version
Senior SWE-Bench v2026.06
Refresh cadence
Quarterly
Staleness state
Current
Question availability
50 of 100 tasks public as a Harbor dataset
BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.
A Snorkel AI benchmark that evaluates coding agents on senior-level engineering work: building features from realistically under-specified instructions, investigating bugs that require runtime investigation, and shipping code that matches existing codebase conventions.
Claude Opus 4.8 currently leads the published Senior SWE-Bench snapshot with 24.0% tasteful solve rate (pass@1). BenchLM shows this benchmark for display only and does not use it in overall rankings.
9 AI models are included in BenchLM's mirrored Senior SWE-Bench snapshot, based on the public leaderboard captured on v2026.06 release.
For engineers, researchers, and the plain curious — a weekly brief on new models, ranking shifts, and pricing changes.
Free. No spam. Unsubscribe anytime.