Model profile

Claude Opus 4.6

Name: Claude Opus 4.6
Author: Anthropic

AnthropicSupersededReleased February 2026

Data verified July 20, 2026

Superseded:Anthropic has released newer models in this line —Claude Opus 4.7 (Adaptive)·Claude Opus 4.7

Overall Score

68.59Public #16 of 200Verified #14 of 99

Arena Elo

1498

Eligible category ranks

3of 8

Price (1M tokens)

$5 in / $25 out

API pricing

Speed

40tok/s

Context

Evidence coverage

46 of 321 tracked benchmarks are published. 18 are verified and 28 provisional. 7 of 8 categories are measured.

Updated July 20, 2026Methodology

Published / tracked: 46 / 321
Verified: 18
Provisional: 28
Categories with evidence: 7 / 8

Agentic10 benchmarks
Mixed evidence
Coding9 benchmarks
Mixed evidence
Reasoning2 benchmarks
Reported
Knowledge15 benchmarks
Mixed evidence
Math3 benchmarks
Mixed evidence
Multilingual0 benchmarks
Not measured
Multimodal6 benchmarks
Mixed evidence
Inst. Following1 benchmark
Reported

ProprietaryNon-Reasoning

Confidence:

High

base

Claude Opus 4.6 ranks #16 out of 200 models on the public leaderboard with an overall score of 68.59/100. It also ranks #14 out of 99 on the verified leaderboard. This places it in the mid-tier of AI models, with strengths in specific benchmark categories.

Claude Opus 4.6 is a proprietary model with a 1M token context window. It processes queries without explicit chain-of-thought reasoning, offering faster response times and lower token usage.

Claude Opus 4.6 sits inside the Claude Opus 4.6 family alongside Claude Opus 4.6 (Adaptive). This profile currently has 46 of 321 tracked benchmarks. BenchLM only exposes non-generated benchmark rows publicly, so missing categories stay blank until a sourced evaluation is available.

Its strongest category is Knowledge (#7), while its weakest is Agentic (#21). This performance profile makes it particularly effective for knowledge-intensive tasks like research, analysis, and factual Q&A.

Peer position

Exact provisional scores and ranks for the closest listed peers. A score can appear before a model clears the evidence threshold for a rank, so equal scores can have different rank states.

Range 67.22–69.75

MiniMax M3
MiniMax
#1569.75
MiniMax M3 is #15 with a score of 69.75.
Compare
Claude Opus 4.6Current model
Anthropic
#1668.59
Claude Opus 4.6 is #16 with a score of 68.59.
MiMo-V2-Pro
Xiaomi
#1767.78
MiMo-V2-Pro is #17 with a score of 67.78.
Compare
GLM-5.1
Z.AI
#1867.74
GLM-5.1 is #18 with a score of 67.74.
Compare
Gemini 3 Pro
Google
#1967.73
Gemini 3 Pro is #19 with a score of 67.73.
Compare
Inkling
Thinking Machines Lab
#2067.54
Inkling is #20 with a score of 67.54.
Compare
Qwen3.7 Plus
Alibaba
#2167.22
Qwen3.7 Plus is #21 with a score of 67.22.
Compare

Category percentile

Relative position among models eligible for each sourced category. A higher percentile means a stronger position within that category's ranked cohort; 100 is highest.

Knowledge88%
Eligible cohort rank #7 of 52Category score 85.8
Coding86%
Eligible cohort rank #18 of 122Category score 62.5
Agentic83%
Eligible cohort rank #21 of 119Category score 55.2

Category evidence

Scores and ranks appear only where this model has published benchmark evidence. Categories without displayable source records remain not measured.

Category scores, ranks, weighting, benchmark coverage, and evidence status
Category	Score	Rank	Percentile	Weight	Benchmarks	Evidence
AgenticRank #21 of 119Percentile 83rdWeight 22%10 benchmarksMixed sources	55.2	#21 of 119	83rd	22%	10 benchmarks	Mixed sources
CodingRank #18 of 122Percentile 86thWeight 20%9 benchmarksMixed sources	62.5	#18 of 122	86th	20%	9 benchmarks	Mixed sources
ReasoningRank Not rankedWeight 17%2 benchmarksReported	79.8	Not ranked	Not available	17%	2 benchmarks	Reported
KnowledgeRank #7 of 52Percentile 88thWeight 12%15 benchmarksMixed sources	85.8	#7 of 52	88th	12%	15 benchmarks	Mixed sources
MathRank Not rankedWeight 5%3 benchmarksMixed sources	59.8	Not ranked	Not available	5%	3 benchmarks	Mixed sources
MultilingualWeight 7%0 benchmarksNot measured	Not measured	Not ranked	Not available	7%	0 benchmarks	Not measured
MultimodalRank Not rankedWeight 12%6 benchmarksMixed sources	61.1	Not ranked	Not available	12%	6 benchmarks	Mixed sources
Inst. FollowingRank Not rankedWeight 5%1 benchmarkReported	84.5	Not ranked	Not available	5%	1 benchmark	Reported

Chatbot Arena performance

Scroll horizontally to inspect confidence intervals and vote counts.

Chatbot Arena Elo, confidence interval, and vote count by evaluation view
View	Elo	Confidence interval	Votes
Text Overall	1498	±3.7	64,747
Coding	1549	±5.9	17,906
Math	1503	±10.5	3,582
Instruction Following	1499	±5.5	20,851
Creative Writing	1478	±7.0	10,553
Multi-turn	1511	±6.8	11,402
Hard Prompts	1527	±4.5	41,406
Hard Prompts (English)	1532	±5.6	20,436
Longer Query	1517	±5.4	26,441

Benchmark Details

Rows below have a displayable published verification record. Each source link and provenance note remains in the page HTML while its category is closed. Source-unverified manual rows and generated rows stay hidden.

Agentic10 benchmarks

Terminal-Bench 2.0Provider exact

65.4%Weighted 38%

Source: Anthropic: Claude Opus 4.6 system cardProvenance: Provider exact

OSWorld-VerifiedProvider exact

72.7%Weighted 34%

Source: Anthropic: Claude Opus 4.6 system cardProvenance: Anthropic evaluation table reports Claude Opus 4.6 at 72.7% on OSWorld-Verified.

BrowseCompProvider exact

83.7%Weighted 28%

Source: Anthropic: Claude Mythos Preview System CardProvenance: Section 6.10.2: "Mythos Preview represents only a modest accuracy improvement over our best Claude Opus 4.6 score (86.9% vs. 83.7%)."

τ²-bench resultsReported

τ²-Bench Tool-Agent-User Evaluation

84.8%Display only

Source: Artificial Analysis model benchmarksProvenance: Display-only row synced from the current Artificial Analysis model payload. It is excluded from BenchLM weighted scoring.

Claw-EvalBenchmark exact

70.4%Display only

Source: Claw-Eval leaderboardProvenance: Claw-Eval reports this model as opus46 in the official 2026-05-09 leaderboard snapshot. BenchLM stores the primary Pass^3 value on the local Claw-Eval display key.

DeepSearchQASecondary exact

73.7%Display only

Source: Meta AI: Muse Spark comparison chartProvenance: Secondary exact

CyberGymBenchmark exact

66.6%Display only

Source: CyberGym leaderboardProvenance: CyberGym Level 1 reports Anthropic Agent with Claude Opus 4.6 on the public leaderboard. BenchLM stores the reported target-vulnerability reproduction success rate on the local cyberGym key.

Gert LabsBenchmark exact

Gert Labs Composite Game Benchmark

61.85%Display only

Source: Gert Labs rankingsProvenance: Gert Labs reports this composite leaderboard score in the public rankings API. BenchLM scales the source gscore from 0-1 to 0-100 and stores it as a display-only agentic benchmark.

ResearchClawBenchBenchmark exact

19.9%Display only

Source: ResearchClawBench leaderboardProvenance: ResearchClawBench reports this model as ResearchHarness (Claude-Opus-4.6) in the official Pass@1 leaderboard. BenchLM stores the one-decimal RADS average on the local ResearchClawBench display key and excludes it from weighted rankings.

JobBenchBenchmark exact

36.7%Display only

Source: JobBench paperProvenance: JobBench reports Opus-4.6 under Claude Code on the main-set leaderboard. BenchLM stores the reported main-set score.

Coding9 benchmarks

SWE-RebenchBenchmark exact

65.3%Weighted 20%

Source: SWE-Rebench leaderboardProvenance: Public SWE-Rebench leaderboard lists Claude Opus 4.6 at 65.3% resolved rate.

SWE-bench VerifiedProvider exact

Software Engineering Benchmark Verified

80.8%Weighted 16%

Source: Anthropic: Claude Opus 4.6 system cardProvenance: Provider exact

SWE-bench ProSecondary exact

53.4%Weighted 10%

Source: Meta AI: Muse Spark comparison chartProvenance: Secondary exact

SWE-bench Verified*Secondary exact

SWE-bench Verified (mini-swe-agent-v2)

75.6%Display only

Source: Arcee Trinity-Large-Thinking comparison tableProvenance: Secondary exact

LiveCodeBench ProSecondary exact

70.7%Display only

Source: Meta AI: Muse Spark comparison chartProvenance: Secondary exact

React Native EvalsBenchmark exact

84.1%Display only

Source: React Native Evals leaderboardProvenance: React Native Evals reports this exact overall score for Claude Opus 4.6 in the public dashboard run finished on 2026-04-28.

Vibe Code BenchBenchmark exact

Vibe Code Bench v1.1

57.57%Display only

Source: Vals AI: Vibe Code Bench v1.1Provenance: Vals Vibe Code Bench v1.1 reports this exact row under anthropic/claude-opus-4-6; BenchLM stores it on the local vibeCodeBench key.

AA-SciCodeReported

Artificial Analysis SciCode

45.7%Display only

Source: Artificial Analysis model benchmarksProvenance: Display-only row synced from the current Artificial Analysis model payload. It is excluded from BenchLM weighted scoring.

FrontierCode 1.1 MainBenchmark exact

26.9%Display only

Source: Cognition: FrontierCode 1.1Provenance: Cognition reports Claude Opus 4.6 at 26.9% on FrontierCode 1.1 Main, using the best max effort row from the published data JSON.

Reasoning2 benchmarks

AA-LCRReported

Artificial Analysis Long Context Reasoning

58.3%Display only

Source: Artificial Analysis model benchmarksProvenance: Display-only row synced from the current Artificial Analysis model payload. It is excluded from BenchLM weighted scoring.

CritPtReported

Critical Physics Tasks

2.8%Display only

Source: Artificial Analysis model benchmarksProvenance: Display-only row synced from the current Artificial Analysis model payload. It is excluded from BenchLM weighted scoring.

Knowledge15 benchmarks

HLEProvider exact

Humanity's Last Exam

53%Weighted 45%

Source: Anthropic: Claude Opus 4.6 system cardProvenance: Anthropic Table 2.21.B reports HLE with tools at 53.0% for Claude Opus 4.6.

MMLU-ProSecondary exact

Massive Multitask Language Understanding Professional

82%Weighted 30%

Source: Qwen3.6-Plus comparison tableProvenance: Secondary exact

GPQAProvider exact

Graduate-Level Google-Proof Q&A

91.3%Weighted 7%

Source: Anthropic: Claude Opus 4.6 system cardProvenance: Provider exact

SuperGPQASecondary exact

SuperGPQA: Scaling LLM Evaluation Across 285 Graduate Disciplines

95%Weighted 7%

Source: Qwen3.6-Plus comparison tableProvenance: Secondary exact

GPQA-DSecondary exact

GPQA Diamond

89.2%Display only

Source: Arcee Trinity-Large-Thinking comparison tableProvenance: Secondary exact

MMLU-Pro (Arcee)Secondary exact

MMLU-Pro first-party comparison snapshot

89.1%Display only

Source: Arcee Trinity-Large-Thinking comparison tableProvenance: Secondary exact

HLE w/o toolsSecondary exact

Humanity's Last Exam without tools

40%Display only

Source: Meta AI: Muse Spark comparison chartProvenance: Secondary exact

HealthBench HardSecondary exact

14.8%Display only

Source: Meta AI: Muse Spark comparison chartProvenance: Secondary exact

MedXpertQA (Text)Secondary exact

MedXpertQA Text

52.1%Display only

Source: Meta AI: Muse Spark comparison chartProvenance: Secondary exact

Artificial Analysis Intelligence IndexReported

37.8%Display only

Source: Artificial Analysis model benchmarksProvenance: Display-only row synced from the current Artificial Analysis model payload. It is excluded from BenchLM weighted scoring.

AA-GPQA DiamondReported

Artificial Analysis GPQA Diamond

84.0%Display only

Source: Artificial Analysis model benchmarksProvenance: Display-only row synced from the current Artificial Analysis model payload. It is excluded from BenchLM weighted scoring.

AA-HLEReported

Artificial Analysis Humanity's Last Exam

18.6%Display only

Source: Artificial Analysis model benchmarksProvenance: Display-only row synced from the current Artificial Analysis model payload. It is excluded from BenchLM weighted scoring.

AA-Omniscience IndexReported

Artificial Analysis Omniscience Index

3.5%Display only

Source: Artificial Analysis model benchmarksProvenance: Display-only row synced from the current Artificial Analysis model payload. It is excluded from BenchLM weighted scoring.

AA-Omniscience AccuracyReported

Artificial Analysis Omniscience Accuracy

45.2%Display only

Source: Artificial Analysis model benchmarksProvenance: Display-only row synced from the current Artificial Analysis model payload. It is excluded from BenchLM weighted scoring.

AA-Omniscience Hallucination RateReported

Artificial Analysis Omniscience Hallucination Rate

76.0%Display only

Source: Artificial Analysis model benchmarksProvenance: Display-only row synced from the current Artificial Analysis model payload. It is excluded from BenchLM weighted scoring.

Math3 benchmarks

FrontierMath v2 (Tiers 1-3)Benchmark exact

FrontierMath v2 Tiers 1-3

40.700%Weighted 30%

Source: Epoch AI FrontierMath v2 leaderboardProvenance: Epoch AI reports FrontierMath v2 Tiers 1-3 at 40.7% for claude-opus-4-6_max. BenchLM selects the highest published thinking effort for the model and stores the v2 benchmark slice separately.

FrontierMath v2 (Tier 4)Benchmark exact

FrontierMath v2 Tier 4

22.900%Weighted 10%

Source: Epoch AI FrontierMath v2 leaderboardProvenance: Epoch AI reports FrontierMath v2 Tier 4 at 22.9% for claude-opus-4-6_max. BenchLM selects the highest published thinking effort for the model and stores the v2 benchmark slice separately.

AIME25 (Arcee)Secondary exact

AIME25 first-party comparison snapshot

99.8%Display only

Source: Arcee Trinity-Large-Thinking comparison tableProvenance: Secondary exact

Multimodal6 benchmarks

MMMU-ProProvider exact

Massive Multi-discipline Multimodal Understanding Pro

77.3%Weighted 45%

Source: Anthropic: Claude Opus 4.6 system cardProvenance: Anthropic Table 2.3.A reports MMMU-Pro (with tools) at 77.3% for Claude Opus 4.6, matching the stored BenchLM row.

ERQASecondary exact

51.6%Display only

Source: Meta AI: Muse Spark comparison chartProvenance: Secondary exact

ScreenSpot ProSecondary exact

83.1%Display only

Source: Meta AI: Muse Spark comparison chartProvenance: Secondary exact

MedXpertQA (MM)Secondary exact

MedXpertQA Multimodal

64.8%Display only

Source: Meta AI: Muse Spark comparison chartProvenance: Secondary exact

AA-MMMU-ProReported

Artificial Analysis MMMU-Pro

72.5%Display only

Source: Artificial Analysis model benchmarksProvenance: Display-only row synced from the current Artificial Analysis model payload. It is excluded from BenchLM weighted scoring.

Design Arena WebsiteReported

Design Arena Website Elo

1328Display only

Source: OpenRouter model benchmarksProvenance: Display-only Design Arena Website Elo synced from OpenRouter model benchmark metadata. It is excluded from BenchLM weighted scoring.

Inst. Following1 benchmark

AA-IFBenchReported

Artificial Analysis IFBench

44.6%Display only

Source: Artificial Analysis model benchmarksProvenance: Display-only row synced from the current Artificial Analysis model payload. It is excluded from BenchLM weighted scoring.

Claude Opus 4.6 Family

Base entry

Claude Opus 4.6 (Adaptive)Score 64.18

Frequently Asked Questions

How does Claude Opus 4.6 perform overall in AI benchmarks?

Claude Opus 4.6 currently ranks #16 out of 200 models on BenchLM's provisional leaderboard with an overall score of 68.59. It also ranks #14 out of 99 on the verified leaderboard. It is created by Anthropic. Its published context window is 1M.

Is Claude Opus 4.6 good for knowledge and understanding?

Claude Opus 4.6 ranks #7 out of 52 models in knowledge and understanding benchmarks with an average score of 85.8. It is among the top performers in this category.

Is Claude Opus 4.6 good for coding and programming?

Claude Opus 4.6 ranks #18 out of 122 models in coding and programming benchmarks with an average score of 62.5. There are stronger options in this category.

Is Claude Opus 4.6 good for mathematics?

Claude Opus 4.6 has visible benchmark coverage in mathematics, but BenchLM does not currently assign it a global category rank there.

Is Claude Opus 4.6 good for reasoning and logic?

Claude Opus 4.6 has visible benchmark coverage in reasoning and logic, but BenchLM does not currently assign it a global category rank there.

Is Claude Opus 4.6 good for agentic tool use and computer tasks?

Claude Opus 4.6 ranks #21 out of 119 models in agentic tool use and computer tasks benchmarks with an average score of 55.2. There are stronger options in this category.

Is Claude Opus 4.6 good for multimodal and grounded tasks?

Claude Opus 4.6 has visible benchmark coverage in multimodal and grounded tasks, but BenchLM does not currently assign it a global category rank there.

Is Claude Opus 4.6 good for instruction following?

Claude Opus 4.6 has visible benchmark coverage in instruction following, but BenchLM does not currently assign it a global category rank there.

Which sibling models are related to Claude Opus 4.6?

Claude Opus 4.6 belongs to the Claude Opus 4.6 family. Related variants on BenchLM include Claude Opus 4.6 (Adaptive).

Does Claude Opus 4.6 have full benchmark coverage on BenchLM?

Not yet. Claude Opus 4.6 currently has 46 published benchmark scores out of the 321 benchmarks BenchLM tracks. BenchLM only exposes non-generated public benchmark rows, so missing categories stay blank until a sourced evaluation is available.

What is the context window size of Claude Opus 4.6?

Claude Opus 4.6 has a published context window of 1M, which determines how much text it can process in a single interaction.

Related Resources

Last updated: July 20, 2026 · Runtime metrics stay blank until BenchLM has a sourced snapshot.

Choose with this week’s evidence

Join 2,000+ readers for ranking moves, new releases, pricing changes, and the evidence behind them.

Free. One email per week.

Claude Opus 4.6

Evidence coverage

Evidence by category

Peer position

Category percentile

Category evidence

Chatbot Arena performance

Benchmark Details

Claude Opus 4.6 Family

Frequently Asked Questions

How does Claude Opus 4.6 perform overall in AI benchmarks?

Is Claude Opus 4.6 good for knowledge and understanding?

Is Claude Opus 4.6 good for coding and programming?

Is Claude Opus 4.6 good for mathematics?

Is Claude Opus 4.6 good for reasoning and logic?

Is Claude Opus 4.6 good for agentic tool use and computer tasks?

Is Claude Opus 4.6 good for multimodal and grounded tasks?

Is Claude Opus 4.6 good for instruction following?

Which sibling models are related to Claude Opus 4.6?

Does Claude Opus 4.6 have full benchmark coverage on BenchLM?

What is the context window size of Claude Opus 4.6?

Related Resources

Choose with this week’s evidence