Skip to main content

EdgeBench

A ByteDance Seed benchmark of 134 real-world, day-scale tasks that measures how autonomous agents learn from environment feedback over 12+ hour interaction horizons, spanning scientific and ML, systems and software engineering, optimization, knowledge, formal, and game domains.

How BenchLM shows EdgeBench

BenchLM mirrors the official EdgeBench results published by ByteDance Seed with the July 2, 2026 release. EdgeBench contains 134 real-world, day-scale tasks across 6 domains, built by domain experts averaging 57.2 hours per task, and 51 tasks are publicly released together with the SForge evaluation harness.

This page ranks models by the primary published metric: average score after 12 hours of agent interaction on the full 134-task suite. The mirrored snapshot also preserves each model's score on the 51-task open-source subset.

EdgeBench is display only on BenchLM. The published rows measure long-horizon agent runs inside the SForge harness rather than normalized model-only comparisons, and most of the task suite is not public, so BenchLM does not use these scores as weighted ranking inputs.

5 model rows134 tasks (51 public)6 task domainsScore @12hDisplay only

Score @12h on EdgeBench — July 2, 2026 release

BenchLM mirrors the published score @12h view for EdgeBench. Claude Opus 4.8 leads the public snapshot at 51.3% , followed by GPT-5.5 (48.4%) and GPT-5.4 (39.3%). BenchLM does not use these results to rank models overall.

5 modelsAgenticCurrentDisplay onlyUpdated July 2, 2026 release

The published EdgeBench snapshot is tightly clustered at the top: Claude Opus 4.8 sits at 51.3%, while the third row is only 12.0 points behind. The broader top-10 spread is 20.3 points, so the benchmark still separates strong models even when the leaders cluster.

5 models have been evaluated on EdgeBench. The benchmark falls in the Agentic category. This category carries a 22% weight in BenchLM.ai's overall scoring system. EdgeBench is currently displayed for reference but excluded from the scoring formula, so it does not directly affect overall rankings.

About EdgeBench

Year

2026

Tasks

134 tasks (51 public) across 6 domains

Format

Long-horizon interactive agent evaluation

Difficulty

Day-scale expert tasks

EdgeBench tasks are built by domain experts, averaging 57.2 hours of expert effort per task, and run in the open-source SForge harness, which isolates work and judge containers and gives agents iterative feedback instead of one-shot scoring. 51 of the 134 tasks are publicly released with the full evaluation framework, and reported results across roughly 38,000 hours of recorded agent interaction show performance following a log-sigmoid scaling law in interaction time.

BenchLM freshness & provenance

Version

EdgeBench 2026

Refresh cadence

Quarterly

Staleness state

Current

Question availability

51 of 134 tasks public with the SForge harness

CurrentDisplay only

BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.

Score @12h table (5 models)

1
51.3%
2
48.4%
3
39.3%
4
37.4%
5
31.0%

FAQ

What does EdgeBench measure?

A ByteDance Seed benchmark of 134 real-world, day-scale tasks that measures how autonomous agents learn from environment feedback over 12+ hour interaction horizons, spanning scientific and ML, systems and software engineering, optimization, knowledge, formal, and game domains.

Which model leads the published EdgeBench snapshot?

Claude Opus 4.8 currently leads the published EdgeBench snapshot with 51.3% score @12h. BenchLM shows this benchmark for display only and does not use it in overall rankings.

How many models are evaluated on EdgeBench?

5 AI models are included in BenchLM's mirrored EdgeBench snapshot, based on the public leaderboard captured on July 2, 2026 release.

Last updated: July 2, 2026 release · mirrored from the public benchmark leaderboard

The AI models change fast. We track them for you.

For engineers, researchers, and the plain curious — a weekly brief on new models, ranking shifts, and pricing changes.

Free. No spam. Unsubscribe anytime.