LongBench v2

A long-context benchmark that measures whether models can actually use extended context windows for reasoning and retrieval.

Top models on LongBench v2 — April 29, 2026

As of April 29, 2026, Claude Opus 4.5 leads the LongBench v2 leaderboard with 64.4% , followed by Qwen3.5 397B (63.2%) and Qwen3.6 Plus (62%).

Claude Opus 4.5

Anthropic

Overall 77Context 200K

Qwen3.5 397B

Alibaba

Overall 64Context 128K

Qwen3.6 Plus

Alibaba

Overall 74Context 1M

10 modelsReasoning30% of category scoreCurrentUpdated April 29, 2026

According to BenchLM.ai, Claude Opus 4.5 leads the LongBench v2 benchmark with a score of 64.4%, followed by Qwen3.5 397B (63.2%) and Qwen3.6 Plus (62%). The top models are clustered within 2.4 points, suggesting this benchmark is nearing saturation for frontier models.

10 models have been evaluated on LongBench v2. The benchmark falls in the Reasoning category. This category carries a 17% weight in BenchLM.ai's overall scoring system. Within that category, LongBench v2 contributes 30% of the category score, so strong performance here directly affects a model's overall ranking.

About LongBench v2

Year

2025

Tasks

Long-context tasks

Format

Extended-context retrieval and reasoning

Difficulty

Hard long-context

LongBench v2 is useful because context-window size alone is not a capability. It measures whether a model can retain, retrieve, and reason over long inputs effectively.

BenchLM freshness & provenance

Version

LongBench v2 2025

Refresh cadence

Quarterly

Staleness state

Current

Question availability

Public benchmark set

Current

BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.

Leaderboard (10 models)

1

Claude Opus 4.5

AnthropicClosed

64.4%

2

AlibabaOpen

63.2%

3

AlibabaClosed

62%

4

Moonshot AIOpen

61%

5

Z.AIOpen

60.8%

6

AlibabaOpen

60.6%

7

Qwen3.5-122B-A10B

AlibabaOpen

60.2%

8

Qwen3.5-35B-A3B

AlibabaOpen

59%

9

DeepSeek V4 Pro Base

DeepSeekOpen

51.5%

10

DeepSeek V4 Flash Base

DeepSeekOpen

44.7%

FAQ

What does LongBench v2 measure?

A long-context benchmark that measures whether models can actually use extended context windows for reasoning and retrieval.

Which model scores highest on LongBench v2?

Claude Opus 4.5 by Anthropic currently leads with a score of 64.4% on LongBench v2.

How many models are evaluated on LongBench v2?

10 AI models have been evaluated on LongBench v2 on BenchLM.

Compare Top Models on LongBench v2

Claude Opus 4.5 vs Qwen3.5 397B Qwen3.5 397B vs Qwen3.6 Plus Qwen3.6 Plus vs Kimi K2.5 Kimi K2.5 vs GLM-5

Last updated: April 29, 2026 · BenchLM version LongBench v2 2025

The AI models change fast. We track them for you.

For engineers, researchers, and the plain curious — a weekly brief on new models, ranking shifts, and pricing changes.

Free. No spam. Unsubscribe anytime.