Benchmark profile

ExploitBench v8-bench (ExploitBench)

A cybersecurity benchmark for evaluating LLM agents on full-control V8 exploit synthesis using 16 measured exploit capability flags.

Data verified May 18, 2026

How BenchLM shows ExploitBench

BenchLM mirrors the public ExploitBench v8-bench leaderboard. ExploitBench evaluates LLM cybersecurity agents on V8 exploit synthesis and reports coverage over 16 exploit-capability flags.

ExploitBench is display only on BenchLM because the public rows combine model, environment, and AutoNudge or harness settings. BenchLM keeps the official rows as security-evaluation context rather than weighted model-only evidence.

7 mirrored rows16 capability flagsV8 exploit synthesisOfficial public tableDisplay only

ExploitBench leaderboard GitHub repository Hugging Face dataset

Capability coverage on ExploitBench — May 18, 2026

BenchLM mirrors the published capability coverage view for ExploitBench. Claude Mythos Preview leads the public snapshot at 69% , followed by Claude Mythos Preview (68%) and GPT 5.5 (Codex) (41%). BenchLM does not use these results to rank models overall.

Claude Mythos Preview

Anthropic

anthropic/claude-mythos-preview:autonudge

69%

Overall —

Claude Mythos Preview

Anthropic

anthropic/claude-mythos-preview

68%

Overall —

GPT 5.5 (Codex)

OpenAI

openai/gpt-5.5:autonudge-codex

41%

Overall —

7 modelsExternal benchmark mirrorsCurrentDisplay onlyUpdated May 18, 2026

Capability coverage table (7 models)

Score

Claude Mythos PreviewAnthropic

69%

Claude Mythos PreviewAnthropic

68%

GPT 5.5 (Codex)OpenAI

41%

GPT 5.5OpenAI

34%

GPT 5.5 (Codex)OpenAI

33%

GPT 5.5OpenAI

29%

Claude Opus 4.7Anthropic

27%

The published ExploitBench snapshot places Claude Mythos Preview first at 69%. The third row is 28 points behind. The broader top-10 range is 42 points, so the table still separates the published systems.

7 models have been evaluated on ExploitBench. The benchmark falls in the External benchmark mirrors category. We keep external benchmark mirrors separate from the weighted global scoring system, so these results remain source-specific evidence. ExploitBench is currently displayed for reference but excluded from the scoring formula, so it does not directly affect overall rankings.

About ExploitBench

Year

2026

Tasks

V8 exploit synthesis runs

Format

Capability coverage percentage over 16 flags

Difficulty

Browser exploitation and cybersecurity

ExploitBench measures whether LLM agents can turn patched V8 bugs into progressively stronger exploit capabilities, from reaching vulnerable code to full control. BenchLM mirrors the official public leaderboard as display-only security-evaluation context.

ExploitBench Public benchmark source

BenchLM freshness & provenance

Version

ExploitBench 2026

Refresh cadence

Quarterly

Staleness state

Current

Question availability

Public benchmark set

CurrentDisplay only

BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.

FAQ

What does ExploitBench measure?

A cybersecurity benchmark for evaluating LLM agents on full-control V8 exploit synthesis using 16 measured exploit capability flags.

Which model leads the published ExploitBench snapshot?

Claude Mythos Preview currently leads the published ExploitBench snapshot with 69% capability coverage. BenchLM shows this benchmark for display only and does not use it in overall rankings.

How many models are evaluated on ExploitBench?

7 AI models are included in BenchLM's mirrored ExploitBench snapshot, based on the public leaderboard captured on May 18, 2026.

Last updated: May 18, 2026 · mirrored from the public benchmark leaderboard

Choose a model with this week’s evidence

Join 2,000+ readers for ranking moves, pricing changes, and the claims that still need proof.

One email each week. Unsubscribe anytime.