Skip to main content

CyberGym

A cybersecurity task benchmark for evaluating defensive cyber workflows and vulnerability-oriented agent performance.

Benchmark score on CyberGym — April 23, 2026

BenchLM mirrors the published score view for CyberGym. GPT-5.5 leads the public snapshot at 81.8% , followed by GPT-5.4 (79.0%) and Claude Opus 4.7 (73.1%). BenchLM does not use these results to rank models overall.

3 modelsAgenticCurrentDisplay onlyUpdated April 23, 2026

The published CyberGym snapshot is tightly clustered at the top: GPT-5.5 sits at 81.8%, while the third row is only 8.7 points behind. The broader top-10 spread is 8.7 points, so many of the published scores sit in a relatively narrow band.

3 models have been evaluated on CyberGym. The benchmark falls in the Agentic category. This category carries a 22% weight in BenchLM.ai's overall scoring system. CyberGym is currently displayed for reference but excluded from the scoring formula, so it does not directly affect overall rankings.

About CyberGym

Year

2026

Tasks

1,507 vulnerability analysis instances

Format

Vulnerability reproduction and PoC generation

Difficulty

Real-world cybersecurity

CyberGym includes 1,507 benchmark instances from historical vulnerabilities across 188 large software projects. BenchLM stores CyberGym as a display-only agentic security benchmark when exact provider comparison values are published.

BenchLM freshness & provenance

Version

CyberGym 2026

Refresh cadence

Quarterly

Staleness state

Current

Question availability

Public benchmark set

CurrentDisplay only

BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.

Benchmark score table (3 models)

1
81.8%
2
79.0%
3
73.1%

FAQ

What does CyberGym measure?

A cybersecurity task benchmark for evaluating defensive cyber workflows and vulnerability-oriented agent performance.

Which model scores highest on CyberGym?

GPT-5.5 by OpenAI currently leads with a score of 81.8% on CyberGym.

How many models are evaluated on CyberGym?

3 AI models have been evaluated on CyberGym on BenchLM.

Compare Top Models on CyberGym

Last updated: April 23, 2026 · BenchLM version CyberGym 2026

The AI models change fast. We track them for you.

For engineers, researchers, and the plain curious — a weekly brief on new models, ranking shifts, and pricing changes.

Free. No spam. Unsubscribe anytime.