A cybersecurity benchmark for evaluating LLM agents on full-control V8 exploit synthesis using 16 measured exploit capability flags.
BenchLM mirrors the public ExploitBench v8-bench leaderboard. ExploitBench evaluates LLM cybersecurity agents on V8 exploit synthesis and reports coverage over 16 exploit-capability flags.
ExploitBench is display only on BenchLM because the public rows combine model, environment, and AutoNudge or harness settings. BenchLM keeps the official rows as security-evaluation context rather than weighted model-only evidence.
BenchLM mirrors the published capability coverage view for ExploitBench. Claude Mythos Preview leads the public snapshot at 69% , followed by Claude Mythos Preview (68%) and GPT 5.5 (Codex) (41%). BenchLM does not use these results to rank models overall.
Claude Mythos Preview
Anthropic
anthropic/claude-mythos-preview:autonudge
Claude Mythos Preview
Anthropic
anthropic/claude-mythos-preview
GPT 5.5 (Codex)
OpenAI
openai/gpt-5.5:autonudge-codex
The published ExploitBench snapshot is tightly clustered at the top: Claude Mythos Preview sits at 69%, while the third row is only 28 points behind. The broader top-10 spread is 42 points, so the benchmark still separates strong models even when the leaders cluster.
7 models have been evaluated on ExploitBench. The benchmark falls in the External benchmark mirrors category. BenchLM tracks this category separately from its weighted global scoring system, so these results are best compared on the dedicated Korean benchmark views. ExploitBench is currently displayed for reference but excluded from the scoring formula, so it does not directly affect overall rankings.
Year
2026
Tasks
V8 exploit synthesis runs
Format
Capability coverage percentage over 16 flags
Difficulty
Browser exploitation and cybersecurity
ExploitBench measures whether LLM agents can turn patched V8 bugs into progressively stronger exploit capabilities, from reaching vulnerable code to full control. BenchLM mirrors the official public leaderboard as display-only security-evaluation context.
Version
ExploitBench 2026
Refresh cadence
Quarterly
Staleness state
Current
Question availability
Public benchmark set
BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.
A cybersecurity benchmark for evaluating LLM agents on full-control V8 exploit synthesis using 16 measured exploit capability flags.
Claude Mythos Preview currently leads the published ExploitBench snapshot with 69% capability coverage. BenchLM shows this benchmark for display only and does not use it in overall rankings.
7 AI models are included in BenchLM's mirrored ExploitBench snapshot, based on the public leaderboard captured on May 18, 2026.
For engineers, researchers, and the plain curious — a weekly brief on new models, ranking shifts, and pricing changes.
Free. No spam. Unsubscribe anytime.