Skip to main content

ExploitBench v8-bench (ExploitBench)

A cybersecurity benchmark for evaluating LLM agents on full-control V8 exploit synthesis using 16 measured exploit capability flags.

How BenchLM shows ExploitBench

BenchLM mirrors the public ExploitBench v8-bench leaderboard. ExploitBench evaluates LLM cybersecurity agents on V8 exploit synthesis and reports coverage over 16 exploit-capability flags.

ExploitBench is display only on BenchLM because the public rows combine model, environment, and AutoNudge or harness settings. BenchLM keeps the official rows as security-evaluation context rather than weighted model-only evidence.

7 mirrored rows16 capability flagsV8 exploit synthesisOfficial public tableDisplay only

Capability coverage on ExploitBench — May 18, 2026

BenchLM mirrors the published capability coverage view for ExploitBench. Claude Mythos Preview leads the public snapshot at 69% , followed by Claude Mythos Preview (68%) and GPT 5.5 (Codex) (41%). BenchLM does not use these results to rank models overall.

7 modelsExternal benchmark mirrorsCurrentDisplay onlyUpdated May 18, 2026

The published ExploitBench snapshot is tightly clustered at the top: Claude Mythos Preview sits at 69%, while the third row is only 28 points behind. The broader top-10 spread is 42 points, so the benchmark still separates strong models even when the leaders cluster.

7 models have been evaluated on ExploitBench. The benchmark falls in the External benchmark mirrors category. BenchLM tracks this category separately from its weighted global scoring system, so these results are best compared on the dedicated Korean benchmark views. ExploitBench is currently displayed for reference but excluded from the scoring formula, so it does not directly affect overall rankings.

About ExploitBench

Year

2026

Tasks

V8 exploit synthesis runs

Format

Capability coverage percentage over 16 flags

Difficulty

Browser exploitation and cybersecurity

ExploitBench measures whether LLM agents can turn patched V8 bugs into progressively stronger exploit capabilities, from reaching vulnerable code to full control. BenchLM mirrors the official public leaderboard as display-only security-evaluation context.

BenchLM freshness & provenance

Version

ExploitBench 2026

Refresh cadence

Quarterly

Staleness state

Current

Question availability

Public benchmark set

CurrentDisplay only

BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.

Capability coverage table (7 models)

1
Claude Mythos Previewanthropic/claude-mythos-preview:autonudge
69%
2
Claude Mythos Previewanthropic/claude-mythos-preview
68%
3
GPT 5.5 (Codex)openai/gpt-5.5:autonudge-codex
41%
4
GPT 5.5openai/gpt-5.5:autonudge
34%
5
GPT 5.5 (Codex)openai/gpt-5.5:codex
33%
6
GPT 5.5openai/gpt-5.5
29%
7
Claude Opus 4.7anthropic/claude-opus-4-7:autonudge
27%

FAQ

What does ExploitBench measure?

A cybersecurity benchmark for evaluating LLM agents on full-control V8 exploit synthesis using 16 measured exploit capability flags.

Which model leads the published ExploitBench snapshot?

Claude Mythos Preview currently leads the published ExploitBench snapshot with 69% capability coverage. BenchLM shows this benchmark for display only and does not use it in overall rankings.

How many models are evaluated on ExploitBench?

7 AI models are included in BenchLM's mirrored ExploitBench snapshot, based on the public leaderboard captured on May 18, 2026.

Last updated: May 18, 2026 · mirrored from the public benchmark leaderboard

The AI models change fast. We track them for you.

For engineers, researchers, and the plain curious — a weekly brief on new models, ranking shifts, and pricing changes.

Free. No spam. Unsubscribe anytime.