Benchmark profile

Artificial Analysis Coding Agent Index (AA Coding Agents)

A display-only Artificial Analysis leaderboard for coding-agent systems, combining agent harnesses, host models, and execution settings across software-engineering benchmarks.

How BenchLM shows AA Coding Agents

BenchLM mirrors the Artificial Analysis Coding Agent Index v1.1 page as a display-only agent leaderboard. The source compares coding-agent variants across DeepSWE, Terminal-Bench v2, SWE-Atlas-QnA and reports the average pass@1 index alongside cost, token, and execution-time metadata.

AA Coding Agents is separate from BenchLM model-only rankings. Its rows combine an agent harness, a host model, execution settings, and provider routing, so BenchLM treats the index as external system evidence rather than a weighted base-model benchmark. Component benchmark availability can vary by row in the source payload.

42 indexed rows52 source models3 component benchmarksv1.1Cost/time/token metadataDisplay only

Artificial Analysis coding agents Artificial Analysis methodology DeepSWE component Terminal-Bench component

Index score on AA Coding Agents — June 2026 page snapshot

BenchLM mirrors the published index score view for AA Coding Agents. Claude Code - Opus 5 (max) leads the public snapshot at 65.5% , followed by Codex - GPT-5.6 Sol (xhigh) (65.1%) and Codex - GPT-5.6 Sol (high) (64.1%). BenchLM does not use these results to rank models overall.

Claude Code - Opus 5 (max)

Anthropic

42d261a806e9a96a4769ed568662ad9c

65.5%

Overall —

Codex - GPT-5.6 Sol (xhigh)

OpenAI

c2e237678933a2d30d1b1dd67ee5fcf0

65.1%

Overall —

Codex - GPT-5.6 Sol (high)

OpenAI

c8b8aff9f05b372655e2c62efe4b27d0

64.1%

Overall —

42 modelsCodingCurrentDisplay onlyUpdated June 2026 page snapshot

Index score table (42 models)

Score

Claude Code - Opus 5 (max)Anthropic

65.5%

Codex - GPT-5.6 Sol (xhigh)OpenAI

65.1%

Codex - GPT-5.6 Sol (high)OpenAI

64.1%

Claude Code - Opus 5 (high)Anthropic

63.4%

Codex - GPT-5.6 Terra (max)OpenAI

62.3%

Claude Code - Opus 5 (medium)Anthropic

61.9%

Codex - GPT-5.5 (xhigh)OpenAI

61.5%

Codex - GPT-5.6 Sol (medium)OpenAI

60.6%

Claude Code - Opus 4.8 (max)Anthropic

60.5%

Codex - GPT-5.6 Luna (max)OpenAI

58.7%

Claude Code - Opus 4.8 (xhigh)Anthropic

58.5%

Codex - GPT-5.6 Terra (xhigh)OpenAI

57.1%

Claude Code - Opus 5 (low)Anthropic

56.8%

Claude Code - Opus 4.8 (high)Anthropic

56.7%

Codex - GPT-5.6 Terra (high)OpenAI

55.8%

Codex - GPT-5.6 Luna (xhigh)OpenAI

54.7%

Codex - GPT-5.5 (medium)OpenAI

54.4%

Codex - GPT-5.6 Sol (low)OpenAI

53.6%

Claude Code - Opus 4.8 (medium)Anthropic

53.6%

Codex - GPT-5.6 Luna (high)OpenAI

51.4%

Claude Code - Opus 4.7 (max)Anthropic

50.3%

Opencode - Opus 4.7 (medium)Opencode

50.0%

Codex - GPT-5.6 Terra (medium)OpenAI

47.8%

Claude Code - Opus 4.8 (low)Anthropic

47.4%

Claude Code - Opus 4.6 (medium)Anthropic

46.5%

Cursor CLI - GPT-5.5 (medium)Cursor

46.1%

Cursor CLI - Opus 4.7 (medium)Cursor

45.4%

Codex - GPT-5.6 Sol (none)OpenAI

43.4%

Codex - GPT-5.6 Luna (medium)OpenAI

42.4%

Claude Code - Opus 4.7 (medium)Anthropic

40.5%

Codex - GPT-5.4 (medium)OpenAI

39.1%

Cursor CLI - Composer 2.5Cursor

38.2%

Claude Code - Sonnet 4.6 (medium)Anthropic

37.6%

Cursor CLI - GPT-5.4 (medium)Cursor

36.8%

Codex - GPT-5.6 Terra (low)OpenAI

36.7%

Claude Code - GLM-5.1Anthropic

36.1%

Claude Code - Qwen3.7 Plus (thinking)Anthropic

36.0%

Claude Code - Kimi K2.6Anthropic

32.6%

Cursor CLI - Composer 2Cursor

27.5%

Codex - GPT-5.6 Luna (low)OpenAI

25.1%

Codex - GPT-5.6 Terra (none)OpenAI

23.7%

Codex - GPT-5.6 Luna (none)OpenAI

20.4%

The published AA Coding Agents snapshot places Claude Code - Opus 5 (max) first at 65.5%. The third row is 1.4 points behind. The broader top-10 range is 6.8 points, so many of the published results sit in a relatively narrow band.

42 models have been evaluated on AA Coding Agents. The benchmark falls in the Coding category. This category carries a 20% weight in BenchLM.ai's overall scoring system. AA Coding Agents is currently displayed for reference but excluded from the scoring formula, so it does not directly affect overall rankings.

About AA Coding Agents

Year

2026

Tasks

Composite over DeepSWE, Terminal-Bench v2, and SWE-Atlas-QnA

Format

Average pass@1 index

Difficulty

Real-world coding-agent workflows

BenchLM mirrors the Artificial Analysis Coding Agent Index v1.1 page as a display-only external leaderboard. The source combines DeepSWE, Terminal-Bench v2, and SWE-Atlas-QnA component scores and publishes cost, token, and execution-time metadata. Rows are coding-agent systems rather than pure base-model results.

Artificial Analysis Coding Agent Benchmarks Public benchmark source

BenchLM freshness & provenance

Version

AA Coding Agents 2026

Refresh cadence

Quarterly

Staleness state

Current

Question availability

Public benchmark set

CurrentDisplay only

BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.

FAQ

What does AA Coding Agents measure?

A display-only Artificial Analysis leaderboard for coding-agent systems, combining agent harnesses, host models, and execution settings across software-engineering benchmarks.

Which model leads the published AA Coding Agents snapshot?

Claude Code - Opus 5 (max) currently leads the published AA Coding Agents snapshot with 65.5% index score. BenchLM shows this benchmark for display only and does not use it in overall rankings.

How many models are evaluated on AA Coding Agents?

42 AI models are included in BenchLM's mirrored AA Coding Agents snapshot, based on the public leaderboard captured on June 2026 page snapshot.

Last updated: June 2026 page snapshot · mirrored from the public benchmark leaderboard

Know when it’s worth switching models

The model to choose, the cheaper alternative, and the release we would wait on.

One email each week. Unsubscribe anytime.