MRCRv2 (MRCRv2)

A long-context benchmark for memory, retrieval, and multi-round coherence over large contexts.

According to BenchLM.ai, GPT-5.4 Pro leads the MRCRv2 benchmark with a score of 97, followed by GPT-5.4 (97) and Gemini 3 Pro Deep Think (96). The top models are clustered within 1 points, suggesting this benchmark is nearing saturation for frontier models.

121 models have been evaluated on MRCRv2. The benchmark falls in the reasoning category, which carries a 14% weight in BenchLM.ai's overall scoring system. Strong performance here directly impacts a model's overall ranking.

About MRCRv2

Year

2025

Tasks

Long-context retrieval

Format

Multi-round long-context evaluation

Difficulty

Hard long-context

MRCRv2 is especially useful for models that compete on long context, since it checks whether they can retrieve the right information across long, multi-round interactions.

Introducing GPT-5.2 and GPT-5.2 Pro

Leaderboard (121 models)

#1GPT-5.4 Pro
97
#2GPT-5.4
97
#4GPT-5.2 Pro
95
#5GPT-5.3 Instant
94
#6GPT-5.3 Codex
93
#7GPT-5.2
93
#10Claude Opus 4.6
92
#11GPT-5.2-Codex
91
#12Gemini 3.1 Pro
90
#13Grok 4.1
89
#15GLM-5 (Reasoning)
87
#16Gemini 3 Pro
87
#18GPT-5.2 Instant
84
#19GPT-5.1
84
#20o1-preview
83
#21Gemini 2.5 Pro
83
#23GPT-4.1
82
#24GPT-4.1 mini
82
#25GPT-5 (medium)
81
#26Claude Opus 4.5
81
#27Claude Sonnet 4.5
81
#28Kimi K2.5 (Reasoning)
81
#29o3-pro
81
#30o3
81
#31Qwen2.5-1M
81
#32GPT-5 (high)
80
#33o3-mini
80
#34Claude Sonnet 4.6
79
#35GPT-5 mini
79
#37GLM-4.7
78
#38Seed 1.6
78
#39o1
77
#40Seed-2.0-Lite
77
#42Mercury 2
76
#43Gemini 3 Flash
76
#44GLM-4.7-Flash
76
#46o4-mini (high)
74
#47Seed 1.6 Flash
74
#48MiMo-V2-Flash
73
#49Step 3.5 Flash
73
#50GLM-5
73
#52Gemini 1.5 Pro
73
#53GPT-4.1 nano
73
#54DeepSeekMath V2
72
#55Claude 4 Sonnet
72
#56Seed-2.0-Mini
72
#57DeepSeek Coder 2.0
71
#58Grok 4
71
#59Claude 4.1 Opus
71
#60Qwen2.5-72B
71
#61Qwen3.5 397B
71
#62DeepSeek V3.2
70
#63Claude Haiku 4.5
70
#64Claude 3.5 Sonnet
70
#65Kimi K2.5
70
#66DeepSeek LLM 2.0
69
#67MiniMax M2.5
69
#68Mistral Large 2
68
#69Gemini 2.5 Flash
68
#70Mistral Large 3
67
#73Llama 4 Scout
66
#75Aion-2.0
65
#76GPT-4o
63
#77Claude 3 Opus
63
#78Claude 3 Haiku
63
#80GPT-4 Turbo
62
#81GPT-5 nano
61
#82Llama 3 70B
61
#83Ministral 3 14B
60
#85GPT-OSS 120B
59
#86o1-pro
59
#88Z-1
57
#89DeepSeek-R1
57
#91Moonshot v1
56
#93Gemini 1.0 Pro
54
#94Mistral 8x7B
53
#96Qwen3 235B 2507
52
#97GLM-4.5
52
#98Nemotron-4 15B
51
#100Nova Pro
51
#101GLM-4.5-Air
51
#102GPT-4o mini
50
#103Kimi K2
50
#104DeepSeek V3.1
48
#105GPT-OSS 20B
48
#108MiniMax M1 80k
46
#109LFM2-24B-A2B
45
#110Gemma 3 27B
44
#111Qwen2.5-VL-32B
43
#112LFM2.5-1.2B-Thinking
42
#113Ministral 3 8B
41
#114Mistral 7B v0.3
41
#117Mistral 8x7B v0.2
38
#118DBRX Instruct
37
#119LFM2.5-1.2B-Instruct
37
#120Ministral 3 3B
35
#121Phi-4
33

FAQ

What does MRCRv2 measure?

A long-context benchmark for memory, retrieval, and multi-round coherence over large contexts.

Which model scores highest on MRCRv2?

GPT-5.4 Pro by OpenAI currently leads with a score of 97 on MRCRv2.

How many models are evaluated on MRCRv2?

121 AI models have been evaluated on MRCRv2 on BenchLM.

Last updated: March 12, 2026

Weekly LLM Benchmark Digest

Get notified when new models drop, benchmark scores change, or the leaderboard shifts. One email per week.

Free. No spam. Unsubscribe anytime. We only store derived location metadata for consent routing.