A long-horizon software engineering benchmark from Abundant AI with multi-hour tasks spanning library reproductions, full-stack product clones, and ML engineering.
BenchLM tracks SWE-Marathon as a source-backed external benchmark. The official v1.0 site describes 20 multi-hour software engineering tasks and 1,300 logged trials across library reproductions, full-stack product clones, and ML engineering work.
SWE-Marathon is display only on BenchLM. The public site exposes rich task-level leaderboards and trajectory artifacts, but BenchLM is not mirroring an aggregate model table until there is a stable public feed for those rows.
Year
2026
Tasks
20 multi-hour software engineering tasks
Format
Task resolution and trajectory review
Difficulty
Ultra-long-horizon software engineering
BenchLM tracks SWE-Marathon as a display-only external benchmark. The official v1.0 site reports 20 multi-hour tasks, 1,300 logged trials, task-level leaderboards, and replayable trajectory artifacts; BenchLM keeps it source-metadata-only until there is a stable public aggregate feed.
Version
SWE-Marathon 2026
Refresh cadence
Quarterly
Staleness state
Current
Question availability
Public benchmark set
BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.
A long-horizon software engineering benchmark from Abundant AI with multi-hour tasks spanning library reproductions, full-stack product clones, and ML engineering.
No models have been evaluated on SWE-Marathon yet.
0 AI models are included in BenchLM's mirrored SWE-Marathon snapshot, based on the public leaderboard captured on SWE-Marathon v1.0.
For engineers, researchers, and the plain curious — a weekly brief on new models, ranking shifts, and pricing changes.
Free. No spam. Unsubscribe anytime.