Benchmark analysis of Mercury 2 by Inception across 32 sourced tests on BenchLM.
According to BenchLM.ai, Mercury 2 ranks #43 out of 123 models with an overall score of 65/100. While not a frontier model, it offers specific advantages depending on the use case.
Mercury 2 is a proprietary model with a 128K token context window. It uses explicit chain-of-thought reasoning, which typically improves performance on math and complex reasoning tasks at the cost of higher latency and token usage.
Its strongest category is Reasoning (#37), while its weakest is Multimodal & Grounded (#58). This performance profile makes it particularly strong for complex reasoning, multi-step problem solving, and analytical tasks.
Creator
Inception
Source Type
ProprietaryReasoning
ReasoningContext Window
128K
Overall Score
Arena Elo
1268
Mercury 2 ranks #43 out of 123 models with an overall score of 65. It is created by Inception and features a 128K context window.
Mercury 2 ranks #53 out of 123 models in knowledge and understanding benchmarks with an average score of 57.2. There are stronger options in this category.
Mercury 2 ranks #57 out of 123 models in coding and programming benchmarks with an average score of 41.1. There are stronger options in this category.
Mercury 2 ranks #46 out of 123 models in mathematics benchmarks with an average score of 80.9. There are stronger options in this category.
Mercury 2 ranks #37 out of 123 models in reasoning and logic benchmarks with an average score of 80.1. There are stronger options in this category.
Mercury 2 ranks #38 out of 123 models in agentic tool use and computer tasks benchmarks with an average score of 63.7. There are stronger options in this category.
Mercury 2 ranks #58 out of 123 models in multimodal and grounded tasks benchmarks with an average score of 68.3. There are stronger options in this category.
Mercury 2 ranks #51 out of 123 models in instruction following benchmarks with an average score of 84. There are stronger options in this category.
Mercury 2 ranks #53 out of 123 models in multilingual tasks benchmarks with an average score of 79.7. There are stronger options in this category.
Mercury 2 has a context window of 128K, which determines how much text it can process in a single interaction.
New model releases, benchmark scores, and leaderboard changes. Every Friday.
Free. Your signup is stored with a derived country code for compliance routing.