A Cognition software-engineering benchmark that evaluates whether coding agents produce mergeable, production-quality pull requests, scoring correctness, tests, scope, style, and maintainability through maintainer-authored rubrics.
BenchLM mirrors Cognition's FrontierCode Diamond score snapshot from June 8, 2026. Cognition reports three nested subsets: Diamond (50 hardest tasks), Main (100 hardest tasks), and Extended (150 total tasks). This page ranks models by the best published Diamond score across available reasoning efforts.
FrontierCode is display only on BenchLM. The benchmark tasks are not public, and rows combine model choice with agent harnesses such as Claude Code, Codex, Gemini CLI, mini-swe-agent, and Devin, so BenchLM does not use these scores as weighted model-only ranking inputs.
BenchLM mirrors the published diamond score view for FrontierCode. Claude Opus 4.8 leads the public snapshot at 13.4% , followed by GPT-5.5 (6.3%) and Claude Opus 4.7 (5.2%). BenchLM does not use these results to rank models overall.
Claude Opus 4.8
Anthropic
Claude Opus 4.8 / xhigh / claude-code
GPT-5.5
OpenAI
GPT-5.5 / medium / codex
Claude Opus 4.7
Anthropic
Claude Opus 4.7 / medium / claude-code
The published FrontierCode snapshot is tightly clustered at the top: Claude Opus 4.8 sits at 13.4%, while the third row is only 8.2 points behind. The broader top-10 spread is 12.3 points, so the benchmark still separates strong models even when the leaders cluster.
12 models have been evaluated on FrontierCode. The benchmark falls in the Coding category. This category carries a 20% weight in BenchLM.ai's overall scoring system. FrontierCode is currently displayed for reference but excluded from the scoring formula, so it does not directly affect overall rankings.
Year
2026
Tasks
50 Diamond tasks (150 total across Extended)
Format
Repository task completion with maintainer rubrics
Difficulty
Frontier coding-agent quality
FrontierCode uses 150 software-engineering tasks built with maintainers of 36 open-source repositories. BenchLM displays the hardest 50-task Diamond score as a display-only coding benchmark because the tasks are private and the public rows combine models with agent harnesses such as Claude Code, Codex, Gemini CLI, mini-swe-agent, and Devin.
Version
FrontierCode 2026
Refresh cadence
Quarterly
Staleness state
Current
Question availability
Private tasks with public aggregate results
BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.
A Cognition software-engineering benchmark that evaluates whether coding agents produce mergeable, production-quality pull requests, scoring correctness, tests, scope, style, and maintainability through maintainer-authored rubrics.
Claude Opus 4.8 currently leads the published FrontierCode snapshot with 13.4% diamond score. BenchLM shows this benchmark for display only and does not use it in overall rankings.
12 AI models are included in BenchLM's mirrored FrontierCode snapshot, based on the public leaderboard captured on June 8, 2026.
For engineers, researchers, and the plain curious — a weekly brief on new models, ranking shifts, and pricing changes.
Free. No spam. Unsubscribe anytime.