Claude 4.1 Opus Thinking Benchmark Scores & Performance

Benchmark analysis of Claude 4.1 Opus Thinking by Anthropic across 14 tests.

Creator

Anthropic

Source Type

Proprietary

Reasoning

Reasoning

Context Window

200K

Overall Score

29#79 of 88

Knowledge Benchmarks

MMLU
38
GPQA
37
SuperGPQA
35
OpenBookQA
33

Coding Benchmarks

HumanEval
30

Mathematics Benchmarks

AIME 2023
38
AIME 2024
40
AIME 2025
39
HMMT Feb 2023
34
HMMT Feb 2024
36
HMMT Feb 2025
35
BRUMO 2025
37

Reasoning Benchmarks

SimpleQA
36
MuSR
34

Frequently Asked Questions

How does Claude 4.1 Opus Thinking perform overall in AI benchmarks?

Claude 4.1 Opus Thinking ranks #79 out of 88 models with an overall score of 29. It is created by Anthropic and features a 200K context window.

Is Claude 4.1 Opus Thinking good for knowledge and understanding?

Claude 4.1 Opus Thinking ranks #79 out of 88 models in knowledge and understanding benchmarks with an average score of 35.8. There are stronger options in this category.

Is Claude 4.1 Opus Thinking good for coding and programming?

Claude 4.1 Opus Thinking ranks #79 out of 88 models in coding and programming benchmarks with an average score of 30. There are stronger options in this category.

Is Claude 4.1 Opus Thinking good for mathematics?

Claude 4.1 Opus Thinking ranks #79 out of 88 models in mathematics benchmarks with an average score of 37. There are stronger options in this category.

Is Claude 4.1 Opus Thinking good for reasoning and logic?

Claude 4.1 Opus Thinking ranks #79 out of 88 models in reasoning and logic benchmarks with an average score of 35. There are stronger options in this category.

What is the context window size of Claude 4.1 Opus Thinking?

Claude 4.1 Opus Thinking has a context window of 200K tokens, which determines how much text it can process in a single interaction.