A long-context benchmark that measures whether models can actually use extended context windows for reasoning and retrieval.
According to BenchLM.ai, GPT-5.4 Pro leads the LongBench v2 benchmark with a score of 95, followed by GPT-5.4 (95) and Gemini 3 Pro Deep Think (94). The top models are clustered within 1 points, suggesting this benchmark is nearing saturation for frontier models.
121 models have been evaluated on LongBench v2. The benchmark falls in the reasoning category, which carries a 14% weight in BenchLM.ai's overall scoring system. Strong performance here directly impacts a model's overall ranking.
Year
2025
Tasks
Long-context tasks
Format
Extended-context retrieval and reasoning
Difficulty
Hard long-context
LongBench v2 is useful because context-window size alone is not a capability. It measures whether a model can retain, retrieve, and reason over long inputs effectively.
LongBench v2A long-context benchmark that measures whether models can actually use extended context windows for reasoning and retrieval.
GPT-5.4 Pro by OpenAI currently leads with a score of 95 on LongBench v2.
121 AI models have been evaluated on LongBench v2 on BenchLM.
Get notified when new models drop, benchmark scores change, or the leaderboard shifts. One email per week.
Free. No spam. Unsubscribe anytime. We only store derived location metadata for consent routing.