CyberGym

Name: CyberGym
Creator: BenchLM

A cybersecurity task benchmark for evaluating defensive cyber workflows and vulnerability-oriented agent performance.

Benchmark score on CyberGym — April 23, 2026

BenchLM mirrors the published score view for CyberGym. GPT-5.5 leads the public snapshot at 81.8% , followed by GPT-5.4 (79.0%) and Claude Opus 4.7 (73.1%). BenchLM does not use these results to rank models overall.

1Closed

GPT-5.5

OpenAI

81.8%

Overall 89Context 1M

2Closed

GPT-5.4

OpenAI

79.0%

Overall 88Context 1.05M

3Closed

Claude Opus 4.7

Anthropic

73.1%

Overall 86Context 1M

3 modelsAgenticCurrentDisplay onlyUpdated April 23, 2026

The published CyberGym snapshot is tightly clustered at the top: GPT-5.5 sits at 81.8%, while the third row is only 8.7 points behind. The broader top-10 spread is 8.7 points, so many of the published scores sit in a relatively narrow band.

3 models have been evaluated on CyberGym. The benchmark falls in the Agentic category. This category carries a 22% weight in BenchLM.ai's overall scoring system. CyberGym is currently displayed for reference but excluded from the scoring formula, so it does not directly affect overall rankings.

About CyberGym

Year

2026

Tasks

1,507 vulnerability analysis instances

Format

Vulnerability reproduction and PoC generation

Difficulty

Real-world cybersecurity

CyberGym includes 1,507 benchmark instances from historical vulnerabilities across 188 large software projects. BenchLM stores CyberGym as a display-only agentic security benchmark when exact provider comparison values are published.

CyberGym: Evaluating AI Agents' Real-World Cybersecurity Capabilities at Scale

BenchLM freshness & provenance

Version

CyberGym 2026

Refresh cadence

Quarterly

Staleness state

Current

Question availability

Public benchmark set

CurrentDisplay only

BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.