A cybersecurity task benchmark for evaluating defensive cyber workflows and vulnerability-oriented agent performance.
BenchLM mirrors the published score view for CyberGym. GPT-5.5 leads the public snapshot at 81.8% , followed by GPT-5.4 (79.0%) and Claude Opus 4.7 (73.1%). BenchLM does not use these results to rank models overall.
GPT-5.5
OpenAI
GPT-5.4
OpenAI
Claude Opus 4.7
Anthropic
The published CyberGym snapshot is tightly clustered at the top: GPT-5.5 sits at 81.8%, while the third row is only 8.7 points behind. The broader top-10 spread is 8.7 points, so many of the published scores sit in a relatively narrow band.
3 models have been evaluated on CyberGym. The benchmark falls in the Agentic category. This category carries a 22% weight in BenchLM.ai's overall scoring system. CyberGym is currently displayed for reference but excluded from the scoring formula, so it does not directly affect overall rankings.
Year
2026
Tasks
1,507 vulnerability analysis instances
Format
Vulnerability reproduction and PoC generation
Difficulty
Real-world cybersecurity
CyberGym includes 1,507 benchmark instances from historical vulnerabilities across 188 large software projects. BenchLM stores CyberGym as a display-only agentic security benchmark when exact provider comparison values are published.
Version
CyberGym 2026
Refresh cadence
Quarterly
Staleness state
Current
Question availability
Public benchmark set
BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.
A cybersecurity task benchmark for evaluating defensive cyber workflows and vulnerability-oriented agent performance.
GPT-5.5 by OpenAI currently leads with a score of 81.8% on CyberGym.
3 AI models have been evaluated on CyberGym on BenchLM.
For engineers, researchers, and the plain curious — a weekly brief on new models, ranking shifts, and pricing changes.
Free. No spam. Unsubscribe anytime.