Benchmark profile

FrontierCode 1.1 Main

Cognition's 100-task software-engineering benchmark for whether coding agents produce mergeable, production-quality pull requests, scored for correctness, tests, scope, style, and maintainability through maintainer-authored rubrics.

How we show FrontierCode 1.1

The snapshot mirrors Cognition's current FrontierCode 1.1 Main table captured on July 21, 2026 snapshot. Main contains 100 of the 150 private tasks. Cognition reports each model at its best-performing published reasoning effort.

We keep FrontierCode display-only. Each result combines a model with an agent harness, and the private tasks cannot be independently rerun from the public artifact, so these scores do not enter the weighted model rankings.

8 model-agent rows100 Main tasks5 trials per effortBest-performing effortDisplay only

FrontierCode leaderboard FrontierCode 1.1 methodology Official result data JSON

Main score on FrontierCode 1.1 Main — July 21, 2026 snapshot

BenchLM mirrors the published main score view for FrontierCode 1.1 Main. Claude Fable 5 leads the public snapshot at 53.5% , followed by Claude Opus 4.8 (46.5%) and GPT-5.5 (43.0%). BenchLM does not use these results to rank models overall.

1Closed

Claude Fable 5

Anthropic

Claude Fable 5 / xhigh / claude-code

53.5%

Overall 83.68Context 1M+

2Closed

Claude Opus 4.8

Anthropic

Claude Opus 4.8 / max / claude-code

46.5%

Overall 78.34Context 1M

3Closed

GPT-5.5

OpenAI

GPT-5.5 / xhigh / codex

43.0%

Overall 73.51Context 1M

8 modelsCodingCurrentDisplay onlyUpdated July 21, 2026 snapshot

Main score table (8 models)

Score

Claude Fable 5Anthropic · Closed

53.5%

Claude Opus 4.8Anthropic · Closed

46.5%

GPT-5.5OpenAI · Closed

43.0%

Claude Sonnet 5Anthropic · Closed

42.7%

Claude Opus 4.7Anthropic · Closed

38.5%

GPT-5.4 miniOpenAI · Closed

27.0%

Claude Opus 4.6Anthropic · Closed

26.9%

Claude Sonnet 4.6Anthropic · Closed

24.3%

The published FrontierCode 1.1 Main snapshot places Claude Fable 5 first at 53.5%. The third row is 10.5 points behind. The broader top-10 range is 29.2 points, so the table still separates the published systems.

8 models have been evaluated on FrontierCode 1.1 Main. The benchmark falls in the Coding category. This category carries a 20% weight in BenchLM.ai's overall scoring system. FrontierCode 1.1 Main is currently displayed for reference but excluded from the scoring formula, so it does not directly affect overall rankings.

About FrontierCode 1.1 Main

Year

2026

Tasks

100 private Main tasks (150 in Extended)

Format

Repository task completion with maintainer rubrics

Difficulty

Frontier coding-agent quality

FrontierCode 1.1 Main uses 100 of the benchmark's 150 private software-engineering tasks. The leaderboard reports the best-performing published reasoning effort for each model-agent row. We keep the results display-only because each row combines a model with an agent harness and the private tasks cannot be independently rerun from the public artifact.

FrontierCode leaderboard Public benchmark source

BenchLM freshness & provenance

Version

FrontierCode 1.1 Main

Refresh cadence

Rolling

Staleness state

Current

Question availability

Private tasks with public aggregate results

CurrentDisplay only

BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.

FAQ

What does FrontierCode 1.1 Main measure?

Which model leads the published FrontierCode 1.1 Main snapshot?

Claude Fable 5 currently leads the published FrontierCode 1.1 Main snapshot with 53.5% main score. BenchLM shows this benchmark for display only and does not use it in overall rankings.

How many models are evaluated on FrontierCode 1.1 Main?

8 AI models are included in BenchLM's mirrored FrontierCode 1.1 Main snapshot, based on the public leaderboard captured on July 21, 2026 snapshot.

Last updated: July 21, 2026 snapshot · mirrored from the public benchmark leaderboard

Choose a model with this week’s evidence

Join 2,000+ readers for ranking moves, pricing changes, and the claims that still need proof.

One email each week. Unsubscribe anytime.