A ByteDance Seed benchmark of 134 real-world, day-scale tasks that measures how autonomous agents learn from environment feedback over 12+ hour interaction horizons, spanning scientific and ML, systems and software engineering, optimization, knowledge, formal, and game domains.
BenchLM mirrors the official EdgeBench results published by ByteDance Seed with the July 2, 2026 release. EdgeBench contains 134 real-world, day-scale tasks across 6 domains, built by domain experts averaging 57.2 hours per task, and 51 tasks are publicly released together with the SForge evaluation harness.
This page ranks models by the primary published metric: average score after 12 hours of agent interaction on the full 134-task suite. The mirrored snapshot also preserves each model's score on the 51-task open-source subset.
EdgeBench is display only on BenchLM. The published rows measure long-horizon agent runs inside the SForge harness rather than normalized model-only comparisons, and most of the task suite is not public, so BenchLM does not use these scores as weighted ranking inputs.
BenchLM mirrors the published score @12h view for EdgeBench. Claude Opus 4.8 leads the public snapshot at 51.3% , followed by GPT-5.5 (48.4%) and GPT-5.4 (39.3%). BenchLM does not use these results to rank models overall.
Claude Opus 4.8
Anthropic
GPT-5.5
OpenAI
GPT-5.4
OpenAI
The published EdgeBench snapshot is tightly clustered at the top: Claude Opus 4.8 sits at 51.3%, while the third row is only 12.0 points behind. The broader top-10 spread is 20.3 points, so the benchmark still separates strong models even when the leaders cluster.
5 models have been evaluated on EdgeBench. The benchmark falls in the Agentic category. This category carries a 22% weight in BenchLM.ai's overall scoring system. EdgeBench is currently displayed for reference but excluded from the scoring formula, so it does not directly affect overall rankings.
Year
2026
Tasks
134 tasks (51 public) across 6 domains
Format
Long-horizon interactive agent evaluation
Difficulty
Day-scale expert tasks
EdgeBench tasks are built by domain experts, averaging 57.2 hours of expert effort per task, and run in the open-source SForge harness, which isolates work and judge containers and gives agents iterative feedback instead of one-shot scoring. 51 of the 134 tasks are publicly released with the full evaluation framework, and reported results across roughly 38,000 hours of recorded agent interaction show performance following a log-sigmoid scaling law in interaction time.
Version
EdgeBench 2026
Refresh cadence
Quarterly
Staleness state
Current
Question availability
51 of 134 tasks public with the SForge harness
BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.
A ByteDance Seed benchmark of 134 real-world, day-scale tasks that measures how autonomous agents learn from environment feedback over 12+ hour interaction horizons, spanning scientific and ML, systems and software engineering, optimization, knowledge, formal, and game domains.
Claude Opus 4.8 currently leads the published EdgeBench snapshot with 51.3% score @12h. BenchLM shows this benchmark for display only and does not use it in overall rankings.
5 AI models are included in BenchLM's mirrored EdgeBench snapshot, based on the public leaderboard captured on July 2, 2026 release.
For engineers, researchers, and the plain curious — a weekly brief on new models, ranking shifts, and pricing changes.
Free. No spam. Unsubscribe anytime.