LongBench v2 (LongBench v2)

A long-context benchmark that measures whether models can actually use extended context windows for reasoning and retrieval.

According to BenchLM.ai, GPT-5.4 Pro leads the LongBench v2 benchmark with a score of 95, followed by GPT-5.4 (95) and Gemini 3 Pro Deep Think (94). The top models are clustered within 1 points, suggesting this benchmark is nearing saturation for frontier models.

121 models have been evaluated on LongBench v2. The benchmark falls in the reasoning category, which carries a 14% weight in BenchLM.ai's overall scoring system. Strong performance here directly impacts a model's overall ranking.

About LongBench v2

Year

2025

Tasks

Long-context tasks

Format

Extended-context retrieval and reasoning

Difficulty

Hard long-context

LongBench v2 is useful because context-window size alone is not a capability. It measures whether a model can retain, retrieve, and reason over long inputs effectively.

LongBench v2

Leaderboard (121 models)

#1GPT-5.4 Pro
95
#2GPT-5.4
95
#4GPT-5.2 Pro
93
#5Gemini 3.1 Pro
93
#6GPT-5.3 Codex
92
#7GPT-5.3 Instant
92
#8Claude Opus 4.6
92
#9GPT-5.2
91
#11GPT-5.2-Codex
90
#13Grok 4.1
90
#14Gemini 3 Pro
90
#15GPT-5.2 Instant
89
#16o1-preview
87
#18GLM-5 (Reasoning)
86
#19GPT-5.1
84
#20GPT-5 (high)
83
#21Claude Sonnet 4.6
83
#22Claude Opus 4.5
82
#23Claude Sonnet 4.5
82
#24Kimi K2.5 (Reasoning)
82
#25o3-mini
82
#26o3
82
#27Qwen2.5-1M
82
#28GPT-5 (medium)
81
#29o3-pro
81
#32GPT-5 mini
80
#33Gemini 2.5 Pro
80
#34GPT-4.1
80
#35GPT-4.1 mini
80
#36o1
79
#37GLM-4.7
79
#39GLM-5
77
#40Mercury 2
77
#41Seed 1.6
77
#43Seed-2.0-Lite
76
#44DeepSeekMath V2
75
#45o4-mini (high)
75
#46Gemini 3 Flash
75
#48GPT-4.1 nano
75
#49MiMo-V2-Flash
74
#50Step 3.5 Flash
74
#51DeepSeek Coder 2.0
73
#52Grok 4
72
#53Qwen2.5-72B
72
#54Claude Haiku 4.5
72
#55GLM-4.7-Flash
72
#56Qwen3.5 397B
72
#57Claude 4 Sonnet
71
#58Claude 4.1 Opus
71
#59DeepSeek LLM 2.0
70
#60Claude 3.5 Sonnet
70
#61Seed 1.6 Flash
70
#62Gemini 1.5 Pro
70
#63DeepSeek V3.2
69
#66Seed-2.0-Mini
68
#67Gemini 2.5 Flash
68
#68Mistral Large 3
67
#69Kimi K2.5
67
#71MiniMax M2.5
66
#72Mistral Large 2
66
#73Aion-2.0
64
#75Llama 4 Scout
64
#76Claude 3 Haiku
63
#78GPT-4o
62
#79Claude 3 Opus
62
#80GPT-4 Turbo
62
#81Llama 3 70B
61
#82Ministral 3 14B
60
#84GPT-OSS 120B
58
#85Moonshot v1
58
#86DeepSeek-R1
58
#88Mistral 8x7B
57
#89GPT-5 nano
57
#91Z-1
56
#93o1-pro
54
#95Nemotron-4 15B
52
#96Qwen3 235B 2507
52
#97Gemini 1.0 Pro
51
#99Nova Pro
51
#100GPT-4o mini
49
#101LFM2-24B-A2B
48
#102GLM-4.5
48
#103GPT-OSS 20B
48
#104Gemma 3 27B
47
#105GLM-4.5-Air
47
#106Kimi K2
47
#108DeepSeek V3.1
46
#109MiniMax M1 80k
45
#111Qwen2.5-VL-32B
42
#113LFM2.5-1.2B-Thinking
39
#114Mistral 8x7B v0.2
39
#115Ministral 3 8B
38
#116Mistral 7B v0.3
38
#118DBRX Instruct
36
#119LFM2.5-1.2B-Instruct
34
#120Ministral 3 3B
32
#121Phi-4
30

FAQ

What does LongBench v2 measure?

A long-context benchmark that measures whether models can actually use extended context windows for reasoning and retrieval.

Which model scores highest on LongBench v2?

GPT-5.4 Pro by OpenAI currently leads with a score of 95 on LongBench v2.

How many models are evaluated on LongBench v2?

121 AI models have been evaluated on LongBench v2 on BenchLM.

Last updated: March 12, 2026

Weekly LLM Benchmark Digest

Get notified when new models drop, benchmark scores change, or the leaderboard shifts. One email per week.

Free. No spam. Unsubscribe anytime. We only store derived location metadata for consent routing.