Skip to main content

LongBench v2

A long-context benchmark that measures whether models can actually use extended context windows for reasoning and retrieval.

Top models on LongBench v2 — April 10, 2026

As of April 10, 2026, Claude Opus 4.5 leads the LongBench v2 leaderboard with 64.4% , followed by Qwen3.5 397B (63.2%) and Qwen3.6 Plus (62%).

8 modelsReasoning30% of category scoreCurrentUpdated April 10, 2026

According to BenchLM.ai, Claude Opus 4.5 leads the LongBench v2 benchmark with a score of 64.4%, followed by Qwen3.5 397B (63.2%) and Qwen3.6 Plus (62%). The top models are clustered within 2.4 points, suggesting this benchmark is nearing saturation for frontier models.

8 models have been evaluated on LongBench v2. The benchmark falls in the Reasoning category. This category carries a 17% weight in BenchLM.ai's overall scoring system. Within that category, LongBench v2 contributes 30% of the category score, so strong performance here directly affects a model's overall ranking.

About LongBench v2

Year

2025

Tasks

Long-context tasks

Format

Extended-context retrieval and reasoning

Difficulty

Hard long-context

LongBench v2 is useful because context-window size alone is not a capability. It measures whether a model can retain, retrieve, and reason over long inputs effectively.

BenchLM freshness & provenance

Version

LongBench v2 2025

Refresh cadence

Quarterly

Staleness state

Current

Question availability

Public benchmark set

Current

BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.

Leaderboard (8 models)

1
64.4%
2
63.2%
3
62%
4
61%
5
60.8%
6
60.6%
7
60.2%
8
59%

FAQ

What does LongBench v2 measure?

A long-context benchmark that measures whether models can actually use extended context windows for reasoning and retrieval.

Which model scores highest on LongBench v2?

Claude Opus 4.5 by Anthropic currently leads with a score of 64.4% on LongBench v2.

How many models are evaluated on LongBench v2?

8 AI models have been evaluated on LongBench v2 on BenchLM.

Last updated: April 10, 2026 · BenchLM version LongBench v2 2025

The AI models change fast. We track them for you.

For engineers, researchers, and the plain curious — a weekly brief on new models, ranking shifts, and pricing changes.

Free. No spam. Unsubscribe anytime.