Benchmark profile

SWE-Marathon

A long-horizon software engineering benchmark from Abundant AI with multi-hour tasks spanning library reproductions, full-stack product clones, and ML engineering.

How BenchLM shows SWE-Marathon

BenchLM tracks SWE-Marathon as a source-backed external benchmark. The official v1.0 site describes 20 multi-hour software engineering tasks and 1,300 logged trials across library reproductions, full-stack product clones, and ML engineering work.

SWE-Marathon is display only on BenchLM. The public site exposes rich task-level leaderboards and trajectory artifacts, but BenchLM is not mirroring an aggregate model table until there is a stable public feed for those rows.

20 multi-hour tasks1,300 logged trials< 19% task resolutionApache 2.0Source metadata only

SWE-Marathon site GitHub repository

About SWE-Marathon

Year

2026

Tasks

20 multi-hour software engineering tasks

Format

Task resolution and trajectory review

Difficulty

Ultra-long-horizon software engineering

BenchLM tracks SWE-Marathon as a display-only external benchmark. The official v1.0 site reports 20 multi-hour tasks, 1,300 logged trials, task-level leaderboards, and replayable trajectory artifacts; BenchLM keeps it source-metadata-only until there is a stable public aggregate feed.

SWE-Marathon: Can Agents Autonomously Complete Ultra-Long-Horizon Software Work?Public benchmark source

BenchLM freshness & provenance

Version

SWE-Marathon 2026

Refresh cadence

Quarterly

Staleness state

Current

Question availability

Public benchmark set

CurrentDisplay only

BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.

FAQ

What does SWE-Marathon measure?

A long-horizon software engineering benchmark from Abundant AI with multi-hour tasks spanning library reproductions, full-stack product clones, and ML engineering.

Which model leads the published SWE-Marathon snapshot?

No models have been evaluated on SWE-Marathon yet.

How many models are evaluated on SWE-Marathon?

0 AI models are included in BenchLM's mirrored SWE-Marathon snapshot, based on the public leaderboard captured on SWE-Marathon v1.0.

Last updated: SWE-Marathon v1.0 · mirrored from the public benchmark leaderboard

Choose a model with this week’s evidence

Join 2,000+ readers for ranking moves, pricing changes, and the claims that still need proof.

One email each week. Unsubscribe anytime.