A long-context benchmark for memory, retrieval, and multi-round coherence over large contexts.
As of June 9, 2026, Qwen3.7 Plus leads the MRCRv2 leaderboard with 91.7% , followed by Qwen3.7 Max (90.4%) and Gemini 3.5 Flash (77.3%).
Qwen3.7 Plus
Alibaba
Qwen3.7 Max
Alibaba
Gemini 3.5 Flash
According to BenchLM.ai, Qwen3.7 Plus leads the MRCRv2 benchmark with a score of 91.7%, followed by Qwen3.7 Max (90.4%) and Gemini 3.5 Flash (77.3%). There is significant spread across the leaderboard, making this benchmark effective at differentiating model capabilities.
4 models have been evaluated on MRCRv2. The benchmark falls in the Reasoning category. This category carries a 17% weight in BenchLM.ai's overall scoring system. Within that category, MRCRv2 contributes 25% of the category score, so strong performance here directly affects a model's overall ranking.
Year
2025
Tasks
Long-context retrieval
Format
Multi-round long-context evaluation
Difficulty
Hard long-context
MRCRv2 is especially useful for models that compete on long context, since it checks whether they can retrieve the right information across long, multi-round interactions.
Version
MRCRv2 2025
Refresh cadence
Quarterly
Staleness state
Current
Question availability
Public benchmark set
BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.
A long-context benchmark for memory, retrieval, and multi-round coherence over large contexts.
Qwen3.7 Plus by Alibaba currently leads with a score of 91.7% on MRCRv2.
4 AI models have been evaluated on MRCRv2 on BenchLM.
For engineers, researchers, and the plain curious — a weekly brief on new models, ranking shifts, and pricing changes.
Free. No spam. Unsubscribe anytime.