Skip to main content

MRCRv2

A long-context benchmark for memory, retrieval, and multi-round coherence over large contexts.

How BenchLM shows MRCRv2 right now

BenchLM is tracking MRCRv2 in the local dataset, but exact-source verification records for these rows are still being attached. To avoid a blank benchmark page, BenchLM shows the current tracked rows below as a display-only reference table.

These tracked rows are useful for inspection and spot-checking, but until exact-source attachments are completed they should not be treated as fully verified public benchmark rows.

125 tracked modelsLocal tracked rowsAwaiting exact-source attachmentsDisplay only

Tracked score on MRCRv2 — April 10, 2026

BenchLM mirrors the published tracked score view for MRCRv2. GPT-5.4 leads the public snapshot at 97% , followed by Gemini 3 Pro Deep Think (96%) and GPT-5.2 Pro (95%). BenchLM does not use these results to rank models overall.

125 modelsReasoning25% of category scoreCurrentUpdated April 10, 2026

The published MRCRv2 snapshot is tightly clustered at the top: GPT-5.4 sits at 97%, while the third row is only 2.0 points behind. The broader top-10 spread is 7.0 points, so many of the published scores sit in a relatively narrow band.

125 models have been evaluated on MRCRv2. The benchmark falls in the Reasoning category. This category carries a 17% weight in BenchLM.ai's overall scoring system. Within that category, MRCRv2 contributes 25% of the category score, so strong performance here directly affects a model's overall ranking.

About MRCRv2

Year

2025

Tasks

Long-context retrieval

Format

Multi-round long-context evaluation

Difficulty

Hard long-context

MRCRv2 is especially useful for models that compete on long context, since it checks whether they can retrieve the right information across long, multi-round interactions.

BenchLM freshness & provenance

Version

MRCRv2 2025

Refresh cadence

Quarterly

Staleness state

Current

Question availability

Public benchmark set

Current

BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.

Tracked score table (125 models)

1
GPT-5.4gpt-5-4
97%
2
Gemini 3 Pro Deep Thinkgemini-3-pro-deep-think
96%
3
GPT-5.2 Progpt-5-2-pro
95%
4
GPT-5.3 Instantgpt-5-3-instant
94%
5
GPT-5.1-Codex-Maxgpt-5-1-codex-max
93%
6
GPT-5.3 Codexgpt-5-3-codex
93%
7
GPT-5.2gpt-5-2
93%
8
GPT-5.3-Codex-Sparkgpt-5-3-codex-spark
92%
9
GPT-5.2-Codexgpt-5-2-codex
91%
10
Gemini 3.1 Progemini-3-1-pro
90%
11
Grok 4.1grok-4-1
89%
12
Grok 4.1 Fastgrok-4-1-fast
89%
13
GLM-5 (Reasoning)glm-5-reasoning
87%
14
Gemini 3 Progemini-3-pro
87%
15
Nemotron 3 Ultra 500Bnemotron-3-ultra-500b
85%
16
GPT-5.1gpt-5-1
84%
17
GPT-5.2 Instantgpt-5-2-instant
84%
18
83%
19
Gemini 2.5 Progemini-2-5-pro
83%
20
Qwen3.5 397B (Reasoning)qwen3-5-397b-reasoning
82%
21
GPT-4.1gpt-4-1
82%
22
GPT-4.1 minigpt-4-1-mini
82%
23
GPT-5 (medium)gpt-5-medium
81%
24
Kimi K2.5 (Reasoning)kimi-k2-5-reasoning
81%
25
Claude Opus 4.5claude-opus-4-5
81%
26
Claude Sonnet 4.5claude-sonnet-4-5
81%
27
81%
28
81%
29
Qwen2.5-1Mqwen2-5-1m
81%
30
GPT-5 (high)gpt-5-high
80%
31
80%
32
Claude Sonnet 4.6claude-sonnet-4-6
79%
33
GPT-5 minigpt-5-mini
79%
34
GLM-4.7glm-4-7
78%
35
DeepSeek V3.2 (Thinking)deepseek-v3-2-thinking
78%
36
Seed 1.6seed-1-6
78%
37
77%
38
Seed-2.0-Liteseed-2-0-lite
77%
39
Nemotron 3 Super 120B A12Bnemotron-3-super-120b-a12b
77%
40
Claude Opus 4.6claude-opus-4-6
76%
41
Gemini 3 Flashgemini-3-flash
76%
42
Mercury 2mercury-2
76%
43
GLM-4.7-Flashglm-4-7-flash
76%
44
Nemotron 3 Super 100Bnemotron-3-super-100b
75%
45
o4-mini (high)o4-mini-high
74%
46
Claude 4.1 Opus Thinkingclaude-4-1-opus-thinking
74%
47
Seed 1.6 Flashseed-1-6-flash
74%
48
MiniMax M1 80kminimax-m1-80k
73.4%
49
MiMo-V2-Flashmimo-v2-flash
73%
50
GLM-5glm-5
73%
51
Gemini 3.1 Flash-Litegemini-3-1-flash-lite
73%
52
Step 3.5 Flashstep-3-5-flash
73%
53
Gemini 1.5 Progemini-1-5-pro
73%
54
GPT-4.1 nanogpt-4-1-nano
73%
55
GLM-5.1glm-5-1
73%
56
DeepSeekMath V2deepseekmath-v2
72%
57
Claude 4 Sonnetclaude-4-sonnet
72%
58
Seed-2.0-Miniseed-2-0-mini
72%
59
Grok 4grok-4
71%
60
Qwen3.5 397Bqwen3-5-397b
71%
61
DeepSeek Coder 2.0deepseek-coder-2-0
71%
62
Claude 4.1 Opusclaude-4-1-opus
71%
63
Qwen2.5-72Bqwen2-5-72b
71%
64
Kimi K2.5kimi-k2-5
70%
65
Claude Haiku 4.5claude-haiku-4-5
70%
66
DeepSeek V3.2deepseek-v3-2
70%
67
Claude 3.5 Sonnetclaude-3-5-sonnet
70%
68
DeepSeek LLM 2.0deepseek-llm-2-0
69%
69
MiniMax M2.5minimax-m2-5
69%
70
Mistral Large 2mistral-large-2
68%
71
Gemini 2.5 Flashgemini-2-5-flash
68%
72
Mistral Large 3mistral-large-3
67%
73
Gemma 4 31Bgemma-4-31b
66.4%
74
Grok Code Fast 1grok-code-fast-1
66%
75
Ministral 3 14B (Reasoning)ministral-3-14b-reasoning
66%
76
Llama 4 Scoutllama-4-scout
66%
77
Llama 3.1 405Bllama-3-1-405b
65%
78
Aion-2.0aion-2-0
65%
79
GPT-4ogpt-4o
63%
80
Claude 3 Opusclaude-3-opus
63%
81
Claude 3 Haikuclaude-3-haiku
63%
82
Llama 4 Maverickllama-4-maverick
63%
83
GPT-4 Turbogpt-4-turbo
62%
84
Llama 3 70Bllama-3-70b
61%
85
GPT-5 nanogpt-5-nano
61%
86
Ministral 3 14Bministral-3-14b
60%
87
GPT-OSS 120Bgpt-oss-120b
59%
88
59%
89
Qwen3 235B 2507 (Reasoning)qwen3-235b-2507-reasoning
58%
90
Z-1z-1
57%
91
DeepSeek-R1deepseek-r1
57%
92
Nemotron Ultra 253Bnemotron-ultra-253b
56%
93
Moonshot v1moonshot-v1
56%
94
DeepSeek V3.1 (Reasoning)deepseek-v3-1-reasoning
56%
95
Gemini 1.0 Progemini-1-0-pro
54%
96
Mistral 8x7Bmistral-8x7b
53%
97
Qwen3 235B 2507qwen3-235b-2507
52%
98
Grok 3 [Beta]grok-3-beta
52%
99
GLM-4.5glm-4-5
52%
100
Nemotron-4 15Bnemotron-4-15b
51%
101
Nemotron 3 Nano 30Bnemotron-3-nano-30b
51%
102
Nova Pronova-pro
51%
103
GLM-4.5-Airglm-4-5-air
51%
104
GPT-4o minigpt-4o-mini
50%
105
DeepSeek V3.1deepseek-v3-1
48%
106
GPT-OSS 20Bgpt-oss-20b
48%
107
Ministral 3 8B (Reasoning)ministral-3-8b-reasoning
47%
108
Llama 4 Behemothllama-4-behemoth
46%
109
LFM2-24B-A2Blfm2-24b-a2b
45%
110
Gemma 4 26B A4Bgemma-4-26b-a4b
44.1%
111
Gemma 3 27Bgemma-3-27b
44%
112
LFM2.5-1.2B-Thinkinglfm2-5-1-2b-thinking
42%
113
Mistral 7B v0.3mistral-7b-v0-3
41%
114
Ministral 3 8Bministral-3-8b
41%
115
GPT-5.4 minigpt-5-4-mini
40.7%
116
Ministral 3 3B (Reasoning)ministral-3-3b-reasoning
40%
117
GPT-5.4 nanogpt-5-4-nano
38.7%
118
Mixtral 8x22B Instruct v0.1mixtral-8x22b-instruct-v0-1
38%
119
Mistral 8x7B v0.2mistral-8x7b-v0-2
38%
120
DBRX Instructdbrx-instruct
37%
121
LFM2.5-1.2B-Instructlfm2-5-1-2b-instruct
37%
122
Ministral 3 3Bministral-3-3b
35%
123
Phi-4phi-4
33%
124
Gemma 4 E4Bgemma-4-e4b
25.4%
125
Gemma 4 E2Bgemma-4-e2b
19.1%

FAQ

What does MRCRv2 measure?

A long-context benchmark for memory, retrieval, and multi-round coherence over large contexts.

Which model leads the published MRCRv2 snapshot?

GPT-5.4 currently leads the published MRCRv2 snapshot with a tracked score of 97%. BenchLM shows this benchmark for display only and does not use it in overall rankings.

How many models are evaluated on MRCRv2?

125 AI models are included in BenchLM's mirrored MRCRv2 snapshot, based on the public leaderboard captured on April 10, 2026.

Last updated: April 10, 2026 · mirrored from the public benchmark leaderboard

The AI models change fast. We track them for you.

For engineers, researchers, and the plain curious — a weekly brief on new models, ranking shifts, and pricing changes.

Free. No spam. Unsubscribe anytime.