Model profile

Claude Opus 4.8

Name: Claude Opus 4.8
Author: Anthropic

AnthropicCurrentReleased May 28, 2026

Data verified July 12, 2026

Overall Score

85Prov. #5 of 79Verified #3 of 35

Arena Elo

1477

Categories Ranked

5of 8

Price (1M tokens)

$5 in / $25 out

Speed

Not listed

Context

Evidence coverage

53 of 296 tracked benchmarks are published. 31 are verified and 22 provisional. 8 of 8 categories are measured.

Updated July 12, 2026Methodology

Published / tracked: 53 / 296
Verified: 31
Provisional: 22
Categories measured: 8 / 8

Agentic19 benchmarks
Mixed evidence
Coding12 benchmarks
Mixed evidence
Reasoning2 benchmarks
Reported
Knowledge10 benchmarks
Mixed evidence
Math3 benchmarks
Verified
Multilingual1 benchmark
Verified
Multimodal5 benchmarks
Mixed evidence
Inst. Following1 benchmark
Reported

ProprietaryReasoning

Confidence:

High

base

According to BenchLM.ai, Claude Opus 4.8 ranks #5 out of 79 models on the provisional leaderboard with an overall score of 85/100. It also ranks #3 out of 35 on the verified leaderboard. This places it among the top tier of AI models available in 2026, competing directly with the strongest models from leading AI labs.

Claude Opus 4.8 is a proprietary model with a 1M token context window. It uses explicit chain-of-thought reasoning, which typically improves performance on math and complex reasoning tasks at the cost of higher latency and token usage.

BenchLM links it directly to Claude Opus 4.7 (Adaptive) as the earlier related model in that lineage. This profile currently has 53 of 296 tracked benchmarks. BenchLM only exposes non-generated benchmark rows publicly, so missing categories stay blank until a sourced evaluation is available.

Its strongest category is Mathematics (#1), while its weakest is Multimodal & Grounded (#35). This performance profile makes it particularly strong for mathematical reasoning, scientific computing, and quantitative analysis.

Peer position

Exact provisional scores and ranks for the closest listed peers.

Range 83.0–87.0

Gemini 3 Pro Deep Think
Google
#487.0
Gemini 3 Pro Deep Think is #4 with a score of 87.0.
Compare
Claude Opus 4.8Current model
Anthropic
#585.0
Claude Opus 4.8 is #5 with a score of 85.0.
GPT-5.4
OpenAI
#685.0
GPT-5.4 is #6 with a score of 85.0.
Compare
Claude Opus 4.6
Anthropic
#783.0
Claude Opus 4.6 is #7 with a score of 83.0.
Compare
GPT-5.6 Sol
OpenAI
Unranked86.0
GPT-5.6 Sol is Unranked with a score of 86.0.
Compare
Claude Sonnet 5
Anthropic
Unranked85.0
Claude Sonnet 5 is Unranked with a score of 85.0.
Compare
GPT-5.6 Terra
OpenAI
Unranked85.0
GPT-5.6 Terra is Unranked with a score of 85.0.
Compare

Category percentile

Relative position among models eligible for each sourced category. A higher percentile means a stronger position within that category's ranked cohort; 100 is highest.

Math100%
Eligible cohort rank #1 of 11Category score 73.8
Coding98%
Eligible cohort rank #3 of 101Category score 89.9
Agentic96%
Eligible cohort rank #6 of 120Category score 93.6
Knowledge94%
Eligible cohort rank #7 of 107Category score 88.2
Multimodal69%
Eligible cohort rank #35 of 110Category score 66.2

Category evidence

Scores and ranks appear only where this model has published benchmark evidence. Categories without displayable source records remain not measured.

Category scores, ranks, weighting, benchmark coverage, and evidence status
Category	Score	Rank	Percentile	Weight	Benchmarks	Evidence
AgenticRank #6 of 120Percentile 96thWeight 22%19 benchmarksMixed sources	93.6	#6 of 120	96th	22%	19 benchmarks	Mixed sources
CodingRank #3 of 101Percentile 98thWeight 20%12 benchmarksMixed sources	89.9	#3 of 101	98th	20%	12 benchmarks	Mixed sources
ReasoningRank Not rankedWeight 17%2 benchmarksReported	0.0	Not ranked	Not available	17%	2 benchmarks	Reported
KnowledgeRank #7 of 107Percentile 94thWeight 12%10 benchmarksMixed sources	88.2	#7 of 107	94th	12%	10 benchmarks	Mixed sources
MathRank #1 of 11Percentile 100thWeight 5%3 benchmarksVerified	73.8	#1 of 11	100th	5%	3 benchmarks	Verified
MultilingualRank Not rankedWeight 7%1 benchmarkVerified	0.0	Not ranked	Not available	7%	1 benchmark	Verified
MultimodalRank #35 of 110Percentile 69thWeight 12%5 benchmarksMixed sources	66.2	#35 of 110	69th	12%	5 benchmarks	Mixed sources
Inst. FollowingRank Not rankedWeight 5%1 benchmarkReported	0.0	Not ranked	Not available	5%	1 benchmark	Reported

Chatbot Arena performance

Scroll horizontally to inspect confidence intervals and vote counts.

Chatbot Arena Elo, confidence interval, and vote count by evaluation view
View	Elo	Confidence interval	Votes
Text Overall	1477	±5.5	22,687
Coding	1537	±8.7	6,195
Math	1481	±17.8	1,078
Instruction Following	1481	±8.1	7,481
Creative Writing	1461	±10.3	4,007
Multi-turn	1496	±10.0	4,190
Hard Prompts	1510	±6.7	14,810
Hard Prompts (English)	1513	±8.1	7,585
Longer Query	1506	±7.7	10,212

Benchmark Details

Rows below have a displayable published verification record. Each source link and provenance note remains in the page HTML while its category is closed. Source-unverified manual rows and generated rows stay hidden.

Agentic19 benchmarks

Terminal-Bench 2.0Provider exact

74.6%Weighted 38%

Source: Anthropic: Claude Opus 4.8 system cardProvenance: Section 8.3 reports Terminal-Bench 2.1 at 74.6% mean reward using the Terminus-2 harness.

OSWorld-VerifiedProvider exact

83.4%Weighted 34%

Source: Anthropic: Claude Opus 4.8 system cardProvenance: Section 8.12.6 reports OSWorld-Verified at 83.4% first-attempt success rate.

BrowseCompProvider exact

84.3%Weighted 28%

Source: Anthropic: Claude Opus 4.8 system cardProvenance: Section 8.10.2 reports single-agent BrowseComp at 84.3%.

DeepSearchQAProvider exact

93.1%Display only

Source: Anthropic: Claude Opus 4.8 system cardProvenance: Section 8.10.3 reports DeepSearchQA F1 at 93.1%.

Finance Agent v2Provider exact

53.9%Display only

Source: Anthropic: Claude Opus 4.8 system cardProvenance: Section 8.13.2 reports Finance Agent Benchmark v2 at 53.92%; BenchLM rounds display to one decimal.

GDPval-AAProvider exact

1600Display only

Source: Anthropic: Claude Opus 4.8 system cardProvenance: Section 8.13.6 reports GDPval-AA at 1890 Elo.

MCP AtlasProvider exact

82.2%Display only

Source: Anthropic: Claude Opus 4.8 system cardProvenance: Section 8.13.4 reports MCP-Atlas pass rate at 82.2%.

ToolathlonProvider exact

59.9%Display only

Source: Anthropic: Claude Opus 4.8 system cardProvenance: Section 8.13.7 reports Toolathlon Pass@1 at 59.9%.

Gert LabsBenchmark exact

Gert Labs Composite Game Benchmark

72.97%Display only

Source: Gert Labs rankingsProvenance: Gert Labs reports this composite leaderboard score in the public rankings API. BenchLM scales the source gscore from 0-1 to 0-100 and stores it as a display-only agentic benchmark.

AA Agentic IndexReported

Artificial Analysis Agentic Index

47.2%Display only

Source: Artificial Analysis model benchmarksProvenance: Display-only row synced from the current Artificial Analysis model payload. It is excluded from BenchLM weighted scoring.

Tau2-TelecomReported

94.4%Display only

Source: Artificial Analysis: tau2-bench leaderboardProvenance: Display-only row synced from the current Artificial Analysis evaluation leaderboard. It is excluded from BenchLM weighted scoring.

GDPval-AAReported

GDPval-AA normalized

55.0%Display only

Source: Artificial Analysis model benchmarksProvenance: Display-only row synced from the current Artificial Analysis model payload. It is excluded from BenchLM weighted scoring.

ResearchClawBenchBenchmark exact

21.1%Display only

Source: ResearchClawBench leaderboardProvenance: ResearchClawBench reports this model as ResearchHarness (Claude-Opus-4.8) in the official Pass@1 leaderboard. BenchLM stores the one-decimal RADS average on the local ResearchClawBench display key and excludes it from weighted rankings.

OSWorld 2.0Benchmark exact

20.6%Display only

Source: OSWorld 2.0 paperProvenance: OSWorld 2.0 reports Claude Opus 4.8 batched actions on its 500-step main table. BenchLM stores the binary completion score and notes the corresponding partial score was 54.8%.

AA BriefcaseReported

Artificial Analysis Briefcase

1354Display only

Source: Artificial Analysis: aa-briefcase leaderboardProvenance: Display-only row synced from the current Artificial Analysis evaluation leaderboard. It is excluded from BenchLM weighted scoring.

AA AutomationBenchReported

Artificial Analysis AutomationBench

48.5%Display only

Source: Artificial Analysis: automationbench-aa leaderboardProvenance: Display-only row synced from the current Artificial Analysis evaluation leaderboard. It is excluded from BenchLM weighted scoring.

AA EnterpriseOps-GymReported

Artificial Analysis EnterpriseOps-Gym

44.0%Display only

Source: Artificial Analysis: enterprise-ops-gym-aa leaderboardProvenance: Display-only row synced from the current Artificial Analysis evaluation leaderboard. It is excluded from BenchLM weighted scoring.

AA Harvey LABReported

Artificial Analysis Harvey LAB-AA

7.5%Display only

Source: Artificial Analysis: harvey-lab-aa leaderboardProvenance: Display-only row synced from the current Artificial Analysis evaluation leaderboard. It is excluded from BenchLM weighted scoring.

AA Tau3 BankingReported

Artificial Analysis Tau3-Banking

27.6%Display only

Source: Artificial Analysis: tau3-banking leaderboardProvenance: Display-only row synced from the current Artificial Analysis evaluation leaderboard. It is excluded from BenchLM weighted scoring.

Coding12 benchmarks

SWE-bench ProProvider exact

69.2%Weighted 27%

Source: Anthropic: Claude Opus 4.8 system cardProvenance: Section 8.2 reports SWE-bench Pro at 69.2%.

SWE-bench VerifiedProvider exact

Software Engineering Benchmark Verified

88.6%Weighted 16%

Source: Anthropic: Claude Opus 4.8 system cardProvenance: Section 8.2 reports SWE-bench Verified at 88.6%.

SWE MultilingualProvider exact

84.4%Display only

Source: Anthropic: Claude Opus 4.8 system cardProvenance: Section 8.2 reports SWE-bench Multilingual at 84.4%.

SWE MultimodalProvider exact

SWE-bench Multimodal

38.4%Display only

Source: Anthropic: Claude Opus 4.8 system cardProvenance: Section 8.2 reports SWE-bench Multimodal at 38.4%.

Terminal-Bench 2.0Provider exact

74.6%Display only

Source: Anthropic: Claude Opus 4.8 system cardProvenance: Section 8.3 reports Terminal-Bench 2.1 at 74.6% mean reward using the Terminus-2 harness.

cursorBench31Benchmark exact

58.4%Display only

Source: Cursor evals: CursorBench 3.1Provenance: Cursor reports Opus 4.8 High at this exact CursorBench 3.1 score on its public evals page. BenchLM stores it on the Claude Opus 4.8 row as a display-only coding-agent benchmark.

cursorBench32Benchmark exact

62.3%Display only

Source: Cursor evals: CursorBench 3.2Provenance: Cursor reports Opus 4.8 Max at this exact CursorBench 3.2 score on its public evals page. BenchLM stores it on the Claude Opus 4.8 row as a display-only coding-agent benchmark.

AA Coding IndexReported

Artificial Analysis Coding Index

74.3%Display only

Source: Artificial Analysis model benchmarksProvenance: Display-only row synced from the current Artificial Analysis model payload. It is excluded from BenchLM weighted scoring.

Terminal-Bench HardReported

58.3%Display only

Source: Artificial Analysis: terminalbench-hard leaderboardProvenance: Display-only row synced from the current Artificial Analysis evaluation leaderboard. It is excluded from BenchLM weighted scoring.

AA-SciCodeReported

Artificial Analysis SciCode

53.5%Display only

Source: Artificial Analysis: scicode leaderboardProvenance: Display-only row synced from the current Artificial Analysis evaluation leaderboard. It is excluded from BenchLM weighted scoring.

FrontierCodeBenchmark exact

FrontierCode Diamond

46.5%Display only

Source: Cognition: FrontierCode 1.1Provenance: Cognition reports Claude Opus 4.8 at 46.5% on FrontierCode 1.1 Main, using the best max effort row from the published data JSON.

AA Terminal-Bench 2.1Reported

Artificial Analysis Terminal-Bench v2.1

84.6%Display only

Source: Artificial Analysis: terminalbench-v2-1 leaderboardProvenance: Display-only row synced from the current Artificial Analysis evaluation leaderboard. It is excluded from BenchLM weighted scoring.

Reasoning2 benchmarks

AA-LCRReported

Artificial Analysis Long Context Reasoning

67.7%Display only

Source: Artificial Analysis: artificial-analysis-long-context-reasoning leaderboardProvenance: Display-only row synced from the current Artificial Analysis evaluation leaderboard. It is excluded from BenchLM weighted scoring.

CritPtReported

Critical Physics Tasks

20.9%Display only

Source: Artificial Analysis: critpt leaderboardProvenance: Display-only row synced from the current Artificial Analysis evaluation leaderboard. It is excluded from BenchLM weighted scoring.

Knowledge10 benchmarks

HLEProvider exact

Humanity's Last Exam

57.9%Weighted 32%

Source: Anthropic: Claude Opus 4.8 system cardProvenance: Section 8.10.1 reports Humanity's Last Exam with tools at 57.9%.

GPQAProvider exact

Graduate-Level Google-Proof Q&A

93.6%Weighted 5%

Source: Anthropic: Claude Opus 4.8 system cardProvenance: Section 8.6 reports GPQA Diamond at 93.6%; BenchLM stores that exact value on both gpqaDiamond and the weighted gpqa lane.

GPQA-DProvider exact

GPQA Diamond

93.6%Display only

Source: Anthropic: Claude Opus 4.8 system cardProvenance: Section 8.6 reports GPQA Diamond at 93.6%.

HLE w/o toolsProvider exact

Humanity's Last Exam without tools

49.8%Display only

Source: Anthropic: Claude Opus 4.8 system cardProvenance: Section 8.10.1 reports Humanity's Last Exam without tools at 49.8%.

Artificial Analysis Intelligence IndexReported

55.7%Display only

Source: Artificial Analysis: artificial-analysis-intelligence-index leaderboardProvenance: Display-only row synced from the current Artificial Analysis evaluation leaderboard. It is excluded from BenchLM weighted scoring.

AA-GPQA DiamondReported

Artificial Analysis GPQA Diamond

92.0%Display only

Source: Artificial Analysis: gpqa-diamond leaderboardProvenance: Display-only row synced from the current Artificial Analysis evaluation leaderboard. It is excluded from BenchLM weighted scoring.

AA-HLEReported

Artificial Analysis Humanity's Last Exam

45.7%Display only

Source: Artificial Analysis: humanitys-last-exam leaderboardProvenance: Display-only row synced from the current Artificial Analysis evaluation leaderboard. It is excluded from BenchLM weighted scoring.

AA-Omniscience IndexReported

Artificial Analysis Omniscience Index

27.4%Display only

Source: Artificial Analysis: omniscience leaderboardProvenance: Display-only row synced from the current Artificial Analysis evaluation leaderboard. It is excluded from BenchLM weighted scoring.

AA-Omniscience AccuracyReported

Artificial Analysis Omniscience Accuracy

46.6%Display only

Source: Artificial Analysis model benchmarksProvenance: Display-only row synced from the current Artificial Analysis model payload. It is excluded from BenchLM weighted scoring.

AA-Omniscience Hallucination RateReported

Artificial Analysis Omniscience Hallucination Rate

35.9%Display only

Source: Artificial Analysis model benchmarksProvenance: Display-only row synced from the current Artificial Analysis model payload. It is excluded from BenchLM weighted scoring.

Math3 benchmarks

FrontierMath v2 (Tiers 1-3)Benchmark exact

FrontierMath v2 Tiers 1-3

47.241%Weighted 30%

Source: Epoch AI FrontierMath v2 leaderboardProvenance: Epoch AI reports FrontierMath v2 Tiers 1-3 at 47.241% for claude-opus-4-8_max. BenchLM selects the highest published thinking effort for the model and stores the v2 benchmark slice separately.

USAMO 2026Provider exact

United States of America Mathematical Olympiad 2026

96.7%Weighted 10%

Source: Anthropic: Claude Opus 4.8 system cardProvenance: Section 8.7 reports USAMO 2026 at 96.7%.

FrontierMath v2 (Tier 4)Benchmark exact

FrontierMath v2 Tier 4

31.250%Weighted 10%

Source: Epoch AI FrontierMath v2 leaderboardProvenance: Epoch AI reports FrontierMath v2 Tier 4 at 31.25% for claude-opus-4-8_max. BenchLM selects the highest published thinking effort for the model and stores the v2 benchmark slice separately.

Multilingual1 benchmark

INCLUDEProvider exact

87.6%Display only

Source: Anthropic: Claude Opus 4.8 system cardProvenance: Section 8.15.3 reports INCLUDE average accuracy at 87.6%.

Multimodal5 benchmarks

OfficeQA ProProvider exact

66.2%Weighted 30%

Source: Anthropic: Claude Opus 4.8 system cardProvenance: Section 8.13.1 reports OfficeQA Pro at 66.2% exact-match accuracy under Anthropic's internal agentic harness.

CharXivProvider exact

CharXiv Reasoning

89.9%Weighted 25%

Source: Anthropic: Claude Opus 4.8 system cardProvenance: Section 8.12.4 reports CharXiv Reasoning with Python tools at 89.9%.

ScreenSpot ProProvider exact

87.9%Display only

Source: Anthropic: Claude Opus 4.8 system cardProvenance: Section 8.12.5 reports ScreenSpot-Pro with Python tools at 87.9%.

CharXiv w/o toolsProvider exact

CharXiv Reasoning without tools

80.5%Display only

Source: Anthropic: Claude Opus 4.8 system cardProvenance: Section 8.12.4 reports CharXiv Reasoning without tools at 80.5%.

Design Arena WebsiteReported

Design Arena Website Elo

1281Display only

Source: OpenRouter model benchmarksProvenance: Display-only Design Arena Website Elo synced from OpenRouter model benchmark metadata. It is excluded from BenchLM weighted scoring.

Inst. Following1 benchmark

AA-IFBenchReported

Artificial Analysis IFBench

62.2%Display only

Source: Artificial Analysis model benchmarksProvenance: Display-only row synced from the current Artificial Analysis model payload. It is excluded from BenchLM weighted scoring.

Claude Opus 4.8 Family

Base entry

Related Earlier Model

Claude Opus 4.7 (Adaptive)

Frequently Asked Questions

How does Claude Opus 4.8 perform overall in AI benchmarks?

Claude Opus 4.8 currently ranks #5 out of 79 models on BenchLM's provisional leaderboard with an overall score of 85. It also ranks #3 out of 35 on the verified leaderboard. It is created by Anthropic and features a 1M context window.

Is Claude Opus 4.8 good for knowledge and understanding?

Claude Opus 4.8 ranks #7 out of 79 models in knowledge and understanding benchmarks with an average score of 88.2. It is among the top performers in this category.

Is Claude Opus 4.8 good for coding and programming?

Claude Opus 4.8 ranks #3 out of 79 models in coding and programming benchmarks with an average score of 89.9. It is among the top performers in this category.

Is Claude Opus 4.8 good for mathematics?

Claude Opus 4.8 ranks #1 out of 79 models in mathematics benchmarks with an average score of 73.8. It is among the top performers in this category.

Is Claude Opus 4.8 good for reasoning and logic?

Claude Opus 4.8 has visible benchmark coverage in reasoning and logic, but BenchLM does not currently assign it a global category rank there.

Is Claude Opus 4.8 good for agentic tool use and computer tasks?

Claude Opus 4.8 ranks #6 out of 79 models in agentic tool use and computer tasks benchmarks with an average score of 93.6. It is among the top performers in this category.

Is Claude Opus 4.8 good for multimodal and grounded tasks?

Claude Opus 4.8 ranks #35 out of 79 models in multimodal and grounded tasks benchmarks with an average score of 66.2. There are stronger options in this category.

Is Claude Opus 4.8 good for instruction following?

Claude Opus 4.8 has visible benchmark coverage in instruction following, but BenchLM does not currently assign it a global category rank there.

Is Claude Opus 4.8 good for multilingual tasks?

Claude Opus 4.8 has visible benchmark coverage in multilingual tasks, but BenchLM does not currently assign it a global category rank there.

Does Claude Opus 4.8 have full benchmark coverage on BenchLM?

Not yet. Claude Opus 4.8 currently has 53 published benchmark scores out of the 296 benchmarks BenchLM tracks. BenchLM only exposes non-generated public benchmark rows, so missing categories stay blank until a sourced evaluation is available.

What is the context window size of Claude Opus 4.8?

Claude Opus 4.8 has a context window of 1M, which determines how much text it can process in a single interaction.

Related Resources

Last updated: July 12, 2026 · Runtime metrics stay blank until BenchLM has a sourced snapshot.

Don't miss the next GPT moment

Which models moved up, what is new, and what it costs. One email each week.

Free. One email per week.

Claude Opus 4.8

Evidence coverage

Evidence by category

Peer position

Category percentile

Category evidence

Chatbot Arena performance

Benchmark Details

Claude Opus 4.8 Family

Frequently Asked Questions

How does Claude Opus 4.8 perform overall in AI benchmarks?

Is Claude Opus 4.8 good for knowledge and understanding?

Is Claude Opus 4.8 good for coding and programming?

Is Claude Opus 4.8 good for mathematics?

Is Claude Opus 4.8 good for reasoning and logic?

Is Claude Opus 4.8 good for agentic tool use and computer tasks?

Is Claude Opus 4.8 good for multimodal and grounded tasks?

Is Claude Opus 4.8 good for instruction following?

Is Claude Opus 4.8 good for multilingual tasks?

Does Claude Opus 4.8 have full benchmark coverage on BenchLM?

What is the context window size of Claude Opus 4.8?

Related Resources

Don't miss the next GPT moment