A long-context benchmark for memory, retrieval, and multi-round coherence over large contexts.
According to BenchLM.ai, GPT-5.4 Pro leads the MRCRv2 benchmark with a score of 97, followed by GPT-5.4 (97) and Gemini 3 Pro Deep Think (96). The top models are clustered within 1 points, suggesting this benchmark is nearing saturation for frontier models.
121 models have been evaluated on MRCRv2. The benchmark falls in the reasoning category, which carries a 14% weight in BenchLM.ai's overall scoring system. Strong performance here directly impacts a model's overall ranking.
Year
2025
Tasks
Long-context retrieval
Format
Multi-round long-context evaluation
Difficulty
Hard long-context
MRCRv2 is especially useful for models that compete on long context, since it checks whether they can retrieve the right information across long, multi-round interactions.
Introducing GPT-5.2 and GPT-5.2 ProA long-context benchmark for memory, retrieval, and multi-round coherence over large contexts.
GPT-5.4 Pro by OpenAI currently leads with a score of 97 on MRCRv2.
121 AI models have been evaluated on MRCRv2 on BenchLM.
Get notified when new models drop, benchmark scores change, or the leaderboard shifts. One email per week.
Free. No spam. Unsubscribe anytime. We only store derived location metadata for consent routing.