Skip to main content

Artificial Analysis Coding Agent Index (AA Coding Agents)

A display-only Artificial Analysis leaderboard for coding-agent systems, combining agent harnesses, host models, and execution settings across software-engineering benchmarks.

How BenchLM shows AA Coding Agents

BenchLM mirrors the Artificial Analysis Coding Agent Index v1.1 page as a display-only agent leaderboard. The source compares coding-agent variants across DeepSWE, Terminal-Bench v2, SWE-Atlas-QnA and reports the average pass@1 index alongside cost, token, and execution-time metadata.

AA Coding Agents is separate from BenchLM model-only rankings. Its rows combine an agent harness, a host model, execution settings, and provider routing, so BenchLM treats the index as external system evidence rather than a weighted base-model benchmark. Component benchmark availability can vary by row in the source payload.

12 indexed rows22 source models3 component benchmarksv1.1Cost/time/token metadataDisplay only

Index score on AA Coding Agents — June 2026 page snapshot

BenchLM mirrors the published index score view for AA Coding Agents. Codex - GPT-5.4 (medium) leads the public snapshot at 71.1% , followed by Claude Code - Opus 4.6 (medium) (71.1%) and Cursor CLI - GPT-5.4 (medium) (68.8%). BenchLM does not use these results to rank models overall.

12 modelsCodingCurrentDisplay onlyUpdated June 2026 page snapshot

The published AA Coding Agents snapshot is tightly clustered at the top: Codex - GPT-5.4 (medium) sits at 71.1%, while the third row is only 2.3 points behind. The broader top-10 spread is 17.0 points, so the benchmark still separates strong models even when the leaders cluster.

12 models have been evaluated on AA Coding Agents. The benchmark falls in the Coding category. This category carries a 20% weight in BenchLM.ai's overall scoring system. AA Coding Agents is currently displayed for reference but excluded from the scoring formula, so it does not directly affect overall rankings.

About AA Coding Agents

Year

2026

Tasks

Composite over DeepSWE, Terminal-Bench v2, and SWE-Atlas-QnA

Format

Average pass@1 index

Difficulty

Real-world coding-agent workflows

BenchLM mirrors the Artificial Analysis Coding Agent Index v1.1 page as a display-only external leaderboard. The source combines DeepSWE, Terminal-Bench v2, and SWE-Atlas-QnA component scores and publishes cost, token, and execution-time metadata. Rows are coding-agent systems rather than pure base-model results.

BenchLM freshness & provenance

Version

AA Coding Agents 2026

Refresh cadence

Quarterly

Staleness state

Current

Question availability

Public benchmark set

CurrentDisplay only

BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.

Index score table (12 models)

1
Codex - GPT-5.4 (medium)20e5df586dc56b05e20c6325eb672961
71.1%
2
Claude Code - Opus 4.6 (medium)4335ceeeb2cf7db90e080f28eafec1da
71.1%
3
Cursor CLI - GPT-5.4 (medium)6d116dd69a2396cf5416510c4f003991
68.8%
4
Cursor CLI - Composer 23552e4336fee49b3ffaed5885efbf826
66.6%
5
Claude Code - Opus 4.7 (max)cf2211119f5110126dc550adfffee780
65.0%
6
Opencode - Opus 4.7 (medium)d75bed1c391b478e42e191d6fe1b358a
64.4%
7
Cursor CLI - Opus 4.7 (medium)6000b193ae2c3ff070b2295dbe23e5da
60.2%
8
Gemini CLI - Gemini 3.1 Pro (high)aa0b81d13007c25df61572a332a62b48
56.9%
9
Claude Code - Opus 4.7 (medium)6a7b8d907e453f4825f54a27beef3582
56.8%
10
Claude Code - Sonnet 4.6 (medium)22c67bdff4edaaabba9bf2ee19dd8c5d
54.1%
11
Claude Code - Qwen3.7 Plus (thinking)86a2ae07f8b529d162cc89a2c8ab3ffc
51.9%
12
Cursor CLI - Composer 2.5b7bde994748c432a5f54feea0adde691
51.8%

FAQ

What does AA Coding Agents measure?

A display-only Artificial Analysis leaderboard for coding-agent systems, combining agent harnesses, host models, and execution settings across software-engineering benchmarks.

Which model leads the published AA Coding Agents snapshot?

Codex - GPT-5.4 (medium) currently leads the published AA Coding Agents snapshot with 71.1% index score. BenchLM shows this benchmark for display only and does not use it in overall rankings.

How many models are evaluated on AA Coding Agents?

12 AI models are included in BenchLM's mirrored AA Coding Agents snapshot, based on the public leaderboard captured on June 2026 page snapshot.

Last updated: June 2026 page snapshot · mirrored from the public benchmark leaderboard

The AI models change fast. We track them for you.

For engineers, researchers, and the plain curious — a weekly brief on new models, ranking shifts, and pricing changes.

Free. No spam. Unsubscribe anytime.