Skip to main content

FrontierCode Diamond (FrontierCode)

A Cognition software-engineering benchmark that evaluates whether coding agents produce mergeable, production-quality pull requests, scoring correctness, tests, scope, style, and maintainability through maintainer-authored rubrics.

How BenchLM shows FrontierCode

BenchLM mirrors Cognition's FrontierCode Diamond score snapshot from June 8, 2026. Cognition reports three nested subsets: Diamond (50 hardest tasks), Main (100 hardest tasks), and Extended (150 total tasks). This page ranks models by the best published Diamond score across available reasoning efforts.

FrontierCode is display only on BenchLM. The benchmark tasks are not public, and rows combine model choice with agent harnesses such as Claude Code, Codex, Gemini CLI, mini-swe-agent, and Devin, so BenchLM does not use these scores as weighted model-only ranking inputs.

12 model-agent rows50 Diamond tasks5 trials per effortBest-effort scoreDisplay only

Diamond score on FrontierCode — June 8, 2026

BenchLM mirrors the published diamond score view for FrontierCode. Claude Opus 4.8 leads the public snapshot at 13.4% , followed by GPT-5.5 (6.3%) and Claude Opus 4.7 (5.2%). BenchLM does not use these results to rank models overall.

12 modelsCodingCurrentDisplay onlyUpdated June 8, 2026

The published FrontierCode snapshot is tightly clustered at the top: Claude Opus 4.8 sits at 13.4%, while the third row is only 8.2 points behind. The broader top-10 spread is 12.3 points, so the benchmark still separates strong models even when the leaders cluster.

12 models have been evaluated on FrontierCode. The benchmark falls in the Coding category. This category carries a 20% weight in BenchLM.ai's overall scoring system. FrontierCode is currently displayed for reference but excluded from the scoring formula, so it does not directly affect overall rankings.

About FrontierCode

Year

2026

Tasks

50 Diamond tasks (150 total across Extended)

Format

Repository task completion with maintainer rubrics

Difficulty

Frontier coding-agent quality

FrontierCode uses 150 software-engineering tasks built with maintainers of 36 open-source repositories. BenchLM displays the hardest 50-task Diamond score as a display-only coding benchmark because the tasks are private and the public rows combine models with agent harnesses such as Claude Code, Codex, Gemini CLI, mini-swe-agent, and Devin.

BenchLM freshness & provenance

Version

FrontierCode 2026

Refresh cadence

Quarterly

Staleness state

Current

Question availability

Private tasks with public aggregate results

CurrentDisplay only

BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.

Diamond score table (12 models)

1
Claude Opus 4.8Claude Opus 4.8 / xhigh / claude-code
13.4%
2
GPT-5.5GPT-5.5 / medium / codex
6.3%
3
Claude Opus 4.7Claude Opus 4.7 / medium / claude-code
5.2%
4
Gemini 3.1 ProGemini 3.1 Pro / low / gemini-cli
4.7%
5
GPT-5.4-miniGPT-5.4-mini / xhigh / codex
4.6%
6
Kimi K2.6Kimi K2.6 / none / mini-swe-agent
3.8%
7
Claude Sonnet 4.6Claude Sonnet 4.6 / xhigh / claude-code
3.5%
8
SWE-1.6SWE-1.6 / none / devin
2.5%
9
MiniMax M2.7MiniMax M2.7 / none / mini-swe-agent
2.4%
10
MiniMax M2.5MiniMax M2.5 / none / mini-swe-agent
1.1%
11
Kimi K2.5Kimi K2.5 / none / mini-swe-agent
1.0%
12
Gemini 3.1 Flash LiteGemini 3.1 Flash Lite / low / gemini-cli
0.7%

FAQ

What does FrontierCode measure?

A Cognition software-engineering benchmark that evaluates whether coding agents produce mergeable, production-quality pull requests, scoring correctness, tests, scope, style, and maintainability through maintainer-authored rubrics.

Which model leads the published FrontierCode snapshot?

Claude Opus 4.8 currently leads the published FrontierCode snapshot with 13.4% diamond score. BenchLM shows this benchmark for display only and does not use it in overall rankings.

How many models are evaluated on FrontierCode?

12 AI models are included in BenchLM's mirrored FrontierCode snapshot, based on the public leaderboard captured on June 8, 2026.

Last updated: June 8, 2026 · mirrored from the public benchmark leaderboard

The AI models change fast. We track them for you.

For engineers, researchers, and the plain curious — a weekly brief on new models, ranking shifts, and pricing changes.

Free. No spam. Unsubscribe anytime.