MRCRv2

Name: MRCRv2
Creator: BenchLM

A long-context benchmark for memory, retrieval, and multi-round coherence over large contexts.

How BenchLM shows MRCRv2 right now

BenchLM is tracking MRCRv2 in the local dataset, but exact-source verification records for these rows are still being attached. To avoid a blank benchmark page, BenchLM shows the current tracked rows below as a display-only reference table.

These tracked rows are useful for inspection and spot-checking, but until exact-source attachments are completed they should not be treated as fully verified public benchmark rows.

125 tracked modelsLocal tracked rowsAwaiting exact-source attachmentsDisplay only

Introducing GPT-5.2 and GPT-5.2 Pro

Tracked score on MRCRv2 — April 29, 2026

BenchLM mirrors the published tracked score view for MRCRv2. GPT-5.4 leads the public snapshot at 97% , followed by Gemini 3 Pro Deep Think (96%) and GPT-5.2 Pro (95%). BenchLM does not use these results to rank models overall.

GPT-5.4

OpenAI

gpt-5-4

97%

Overall —

Gemini 3 Pro Deep Think

Google

gemini-3-pro-deep-think

96%

Overall —

GPT-5.2 Pro

OpenAI

gpt-5-2-pro

95%

Overall —

125 modelsReasoning25% of category scoreCurrentUpdated April 29, 2026

The published MRCRv2 snapshot is tightly clustered at the top: GPT-5.4 sits at 97%, while the third row is only 2.0 points behind. The broader top-10 spread is 7.0 points, so many of the published scores sit in a relatively narrow band.

125 models have been evaluated on MRCRv2. The benchmark falls in the Reasoning category. This category carries a 17% weight in BenchLM.ai's overall scoring system. Within that category, MRCRv2 contributes 25% of the category score, so strong performance here directly affects a model's overall ranking.

About MRCRv2

Year

2025

Tasks

Long-context retrieval

Format

Multi-round long-context evaluation

Difficulty

Hard long-context

MRCRv2 is especially useful for models that compete on long context, since it checks whether they can retrieve the right information across long, multi-round interactions.

Introducing GPT-5.2 and GPT-5.2 Pro Public benchmark source

BenchLM freshness & provenance

Version

MRCRv2 2025

Refresh cadence

Quarterly

Staleness state

Current

Question availability

Public benchmark set

Current

BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.

Tracked score table (125 models)

GPT-5.4gpt-5-4

OpenAI

97%

Gemini 3 Pro Deep Thinkgemini-3-pro-deep-think

Google

96%

GPT-5.2 Progpt-5-2-pro

OpenAI

95%

GPT-5.3 Instantgpt-5-3-instant

OpenAI

94%

GPT-5.1-Codex-Maxgpt-5-1-codex-max

OpenAI

93%

GPT-5.3 Codexgpt-5-3-codex

OpenAI

93%

GPT-5.2gpt-5-2

OpenAI

93%

GPT-5.3-Codex-Sparkgpt-5-3-codex-spark

OpenAI

92%

GPT-5.2-Codexgpt-5-2-codex

OpenAI

91%

Gemini 3.1 Progemini-3-1-pro

Google

90%

Grok 4.1grok-4-1

xAI

89%

Grok 4.1 Fastgrok-4-1-fast

xAI

89%

GLM-5 (Reasoning)glm-5-reasoning

Z.AI

87%

Gemini 3 Progemini-3-pro

Google

87%

Nemotron 3 Ultra 500Bnemotron-3-ultra-500b

NVIDIA

85%

GPT-5.1gpt-5-1

OpenAI

84%

GPT-5.2 Instantgpt-5-2-instant

OpenAI

84%

o1-preview

OpenAI

83%

Gemini 2.5 Progemini-2-5-pro

Google

83%

Qwen3.5 397B (Reasoning)qwen3-5-397b-reasoning

Alibaba

82%

GPT-4.1gpt-4-1

OpenAI

82%

GPT-4.1 minigpt-4-1-mini

OpenAI

82%

GPT-5 (medium)gpt-5-medium

OpenAI

81%

Claude Sonnet 4.5claude-sonnet-4-5

Anthropic

81%

Kimi K2.5 (Reasoning)kimi-k2-5-reasoning

Moonshot AI

81%

Claude Opus 4.5claude-opus-4-5

Anthropic

81%

o3-pro

OpenAI

81%

OpenAI

81%

Qwen2.5-1Mqwen2-5-1m

Alibaba

81%

GPT-5 (high)gpt-5-high

OpenAI

80%

o3-mini

OpenAI

80%

Claude Sonnet 4.6claude-sonnet-4-6

Anthropic

79%

GPT-5 minigpt-5-mini

OpenAI

79%

GLM-4.7glm-4-7

Z.AI

78%

DeepSeek V3.2 (Thinking)deepseek-v3-2-thinking

DeepSeek

78%

Seed 1.6seed-1-6

ByteDance

78%

OpenAI

77%

Seed-2.0-Liteseed-2-0-lite

ByteDance

77%

Nemotron 3 Super 120B A12Bnemotron-3-super-120b-a12b

NVIDIA

77%

Claude Opus 4.6claude-opus-4-6

Anthropic

76%

Gemini 3 Flashgemini-3-flash

Google

76%

Mercury 2mercury-2

Inception

76%

GLM-4.7-Flashglm-4-7-flash

Z.AI

76%

Nemotron 3 Super 100Bnemotron-3-super-100b

NVIDIA

75%

o4-mini (high)o4-mini-high

OpenAI

74%

Claude 4.1 Opus Thinkingclaude-4-1-opus-thinking

Anthropic

74%

Seed 1.6 Flashseed-1-6-flash

ByteDance

74%

MiniMax M1 80kminimax-m1-80k

MiniMax

73.4%

MiMo-V2-Flashmimo-v2-flash

Xiaomi

73%

GLM-5.1glm-5-1

Z.AI

73%

GLM-5glm-5

Z.AI

73%

Gemini 3.1 Flash-Litegemini-3-1-flash-lite

Google

73%

Gemini 1.5 Progemini-1-5-pro

Google

73%

GPT-4.1 nanogpt-4-1-nano

OpenAI

73%

Step 3.5 Flashstep-3-5-flash

StepFun

73%

DeepSeekMath V2deepseekmath-v2

DeepSeek

72%

Claude 4 Sonnetclaude-4-sonnet

Anthropic

72%

Seed-2.0-Miniseed-2-0-mini

ByteDance

72%

Grok 4grok-4

xAI

71%

Qwen3.5 397Bqwen3-5-397b

Alibaba

71%

DeepSeek Coder 2.0deepseek-coder-2-0

DeepSeek

71%

Claude 4.1 Opusclaude-4-1-opus

Anthropic

71%

Qwen2.5-72Bqwen2-5-72b

Alibaba

71%

Kimi K2.5kimi-k2-5

Moonshot AI

70%

Claude Haiku 4.5claude-haiku-4-5

Anthropic

70%

DeepSeek V3.2deepseek-v3-2

DeepSeek

70%

Claude 3.5 Sonnetclaude-3-5-sonnet

Anthropic

70%

DeepSeek LLM 2.0deepseek-llm-2-0

DeepSeek

69%

MiniMax M2.5minimax-m2-5

MiniMax

69%

Mistral Large 2mistral-large-2

Mistral

68%

Gemini 2.5 Flashgemini-2-5-flash

Google

68%

Mistral Large 3mistral-large-3

Mistral

67%

Gemma 4 31Bgemma-4-31b

Google

66.4%

Grok Code Fast 1grok-code-fast-1

xAI

66%

Llama 4 Scoutllama-4-scout

Meta

66%

Ministral 3 14B (Reasoning)ministral-3-14b-reasoning

Mistral

66%

Llama 3.1 405Bllama-3-1-405b

Meta

65%

Aion-2.0aion-2-0

Aion Labs

65%

GPT-4ogpt-4o

OpenAI

63%

Claude 3 Opusclaude-3-opus

Anthropic

63%

Claude 3 Haikuclaude-3-haiku

Anthropic

63%

Llama 4 Maverickllama-4-maverick

Meta

63%

GPT-4 Turbogpt-4-turbo

OpenAI

62%

Llama 3 70Bllama-3-70b

Meta

61%

GPT-5 nanogpt-5-nano

OpenAI

61%

Ministral 3 14Bministral-3-14b

Mistral

60%

GPT-OSS 120Bgpt-oss-120b

OpenAI

59%

o1-pro

OpenAI

59%

Qwen3 235B 2507 (Reasoning)qwen3-235b-2507-reasoning

Alibaba

58%

Z-1z-1

57%

DeepSeek-R1deepseek-r1

DeepSeek

57%

Nemotron Ultra 253Bnemotron-ultra-253b

NVIDIA

56%

Moonshot v1moonshot-v1

Moonshot AI

56%

DeepSeek V3.1 (Reasoning)deepseek-v3-1-reasoning

DeepSeek

56%

Gemini 1.0 Progemini-1-0-pro

Google

54%

Mistral 8x7Bmistral-8x7b

Mistral

53%

Qwen3 235B 2507qwen3-235b-2507

Alibaba

52%

Grok 3 [Beta]grok-3-beta

xAI

52%

GLM-4.5glm-4-5

Z.AI

52%

100

Nemotron-4 15Bnemotron-4-15b

NVIDIA

51%

101

Nemotron 3 Nano 30Bnemotron-3-nano-30b

NVIDIA

51%

102

Nova Pronova-pro

Amazon

51%

103

GLM-4.5-Airglm-4-5-air

Z.AI

51%

104

GPT-4o minigpt-4o-mini

OpenAI

50%

105

DeepSeek V3.1deepseek-v3-1

DeepSeek

48%

106

GPT-OSS 20Bgpt-oss-20b

OpenAI

48%

107

Ministral 3 8B (Reasoning)ministral-3-8b-reasoning

Mistral

47%

108

Llama 4 Behemothllama-4-behemoth

Meta

46%

109

LFM2-24B-A2Blfm2-24b-a2b

LiquidAI

45%

110

Gemma 4 26B A4Bgemma-4-26b-a4b

Google

44.1%

111

Gemma 3 27Bgemma-3-27b

Google

44%

112

LFM2.5-1.2B-Thinkinglfm2-5-1-2b-thinking

LiquidAI

42%

113

Mistral 7B v0.3mistral-7b-v0-3

Mistral

41%

114

Ministral 3 8Bministral-3-8b

Mistral

41%

115

GPT-5.4 minigpt-5-4-mini

OpenAI

40.7%

116

Ministral 3 3B (Reasoning)ministral-3-3b-reasoning

Mistral

40%

117

GPT-5.4 nanogpt-5-4-nano

OpenAI

38.7%

118

Mixtral 8x22B Instruct v0.1mixtral-8x22b-instruct-v0-1

Mistral

38%

119

Mistral 8x7B v0.2mistral-8x7b-v0-2

Mistral

38%

120

DBRX Instructdbrx-instruct

Databricks

37%

121

LFM2.5-1.2B-Instructlfm2-5-1-2b-instruct

LiquidAI

37%

122

Ministral 3 3Bministral-3-3b

Mistral

35%

123

Phi-4phi-4

Microsoft

33%

124

Gemma 4 E4Bgemma-4-e4b

Google

25.4%

125

Gemma 4 E2Bgemma-4-e2b

Google

19.1%

FAQ

What does MRCRv2 measure?

A long-context benchmark for memory, retrieval, and multi-round coherence over large contexts.

Which model leads the published MRCRv2 snapshot?

GPT-5.4 currently leads the published MRCRv2 snapshot with a tracked score of 97%. BenchLM shows this benchmark for display only and does not use it in overall rankings.

How many models are evaluated on MRCRv2?

125 AI models are included in BenchLM's mirrored MRCRv2 snapshot, based on the public leaderboard captured on April 29, 2026.

Last updated: April 29, 2026 · mirrored from the public benchmark leaderboard

The AI models change fast. We track them for you.

For engineers, researchers, and the plain curious — a weekly brief on new models, ranking shifts, and pricing changes.

Free. No spam. Unsubscribe anytime.