Mercury 2 Benchmark Scores & Performance

Benchmark analysis of Mercury 2 by Inception across 32 sourced tests on BenchLM.

According to BenchLM.ai, Mercury 2 ranks #43 out of 123 models with an overall score of 65/100. While not a frontier model, it offers specific advantages depending on the use case.

Mercury 2 is a proprietary model with a 128K token context window. It uses explicit chain-of-thought reasoning, which typically improves performance on math and complex reasoning tasks at the cost of higher latency and token usage.

Its strongest category is Reasoning (#37), while its weakest is Multimodal & Grounded (#58). This performance profile makes it particularly strong for complex reasoning, multi-step problem solving, and analytical tasks.

Creator

Inception

Source Type

Proprietary

Reasoning

Reasoning

Context Window

128K

Overall Score

65#43 of 123

Arena Elo

1268

Knowledge Benchmarks

MMLU
78
GPQA
78
SuperGPQA
76
OpenBookQA
74
MMLU-Pro
72
HLE
9
FrontierScience
69

Coding Benchmarks

HumanEval
75
SWE-bench Verified
46
LiveCodeBench
38
SWE-bench Pro
43

Mathematics Benchmarks

AIME 2023
81
AIME 2024
83
AIME 2025
82
HMMT Feb 2023
77
HMMT Feb 2024
79
HMMT Feb 2025
78
BRUMO 2025
80
MATH-500
82

Reasoning Benchmarks

SimpleQA
82
MuSR
82
BBH
87
LongBench v2
77
MRCRv2
76

Agentic Benchmarks

Terminal-Bench 2.0
63
BrowseComp
67
OSWorld-Verified
62

Multimodal & Grounded Benchmarks

MMMU-Pro
66
OfficeQA Pro
71

Instruction Following Benchmarks

IFEval
84

Multilingual Benchmarks

MGSM
81
MMLU-ProX
79

Frequently Asked Questions

How does Mercury 2 perform overall in AI benchmarks?

Mercury 2 ranks #43 out of 123 models with an overall score of 65. It is created by Inception and features a 128K context window.

Is Mercury 2 good for knowledge and understanding?

Mercury 2 ranks #53 out of 123 models in knowledge and understanding benchmarks with an average score of 57.2. There are stronger options in this category.

Is Mercury 2 good for coding and programming?

Mercury 2 ranks #57 out of 123 models in coding and programming benchmarks with an average score of 41.1. There are stronger options in this category.

Is Mercury 2 good for mathematics?

Mercury 2 ranks #46 out of 123 models in mathematics benchmarks with an average score of 80.9. There are stronger options in this category.

Is Mercury 2 good for reasoning and logic?

Mercury 2 ranks #37 out of 123 models in reasoning and logic benchmarks with an average score of 80.1. There are stronger options in this category.

Is Mercury 2 good for agentic tool use and computer tasks?

Mercury 2 ranks #38 out of 123 models in agentic tool use and computer tasks benchmarks with an average score of 63.7. There are stronger options in this category.

Is Mercury 2 good for multimodal and grounded tasks?

Mercury 2 ranks #58 out of 123 models in multimodal and grounded tasks benchmarks with an average score of 68.3. There are stronger options in this category.

Is Mercury 2 good for instruction following?

Mercury 2 ranks #51 out of 123 models in instruction following benchmarks with an average score of 84. There are stronger options in this category.

Is Mercury 2 good for multilingual tasks?

Mercury 2 ranks #53 out of 123 models in multilingual tasks benchmarks with an average score of 79.7. There are stronger options in this category.

What is the context window size of Mercury 2?

Mercury 2 has a context window of 128K, which determines how much text it can process in a single interaction.

Last updated: March 12, 2026

Weekly LLM Updates

New model releases, benchmark scores, and leaderboard changes. Every Friday.

Free. Your signup is stored with a derived country code for compliance routing.