A comprehensive benchmark for multimodal large language models on video understanding, covering temporal reasoning, perception, and question answering over videos.
BenchLM mirrors the published score view for Video-MME. Kimi K2.5 leads the public snapshot at 87.4%. BenchLM does not use these results to rank models overall.
Year
2024
Tasks
Video understanding
Format
Video QA and analysis
Difficulty
Broad multimodal video reasoning
BenchLM tracks the aggregate Video-MME row as a display-oriented video benchmark when providers publish a single overall score rather than separate with-subtitle and without-subtitle splits.
Version
Video-MME 2024
Refresh cadence
Annual
Staleness state
Refreshing
Question availability
Public benchmark set
BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.
A comprehensive benchmark for multimodal large language models on video understanding, covering temporal reasoning, perception, and question answering over videos.
Kimi K2.5 by Moonshot AI currently leads with a score of 87.4% on Video-MME.
1 AI models have been evaluated on Video-MME on BenchLM.
For engineers, researchers, and the plain curious — a weekly brief on new models, ranking shifts, and pricing changes.
Free. No spam. Unsubscribe anytime.