A multi-task video understanding benchmark averaged across MLVU categories.
BenchLM mirrors the published score view for MLVU (M-Avg). Qwen3.6-27B leads the public snapshot at 86.6% , followed by Qwen3.6-35B-A3B (86.2%). BenchLM does not use these results to rank models overall.
Qwen3.6-27B
Alibaba
Qwen3.6-35B-A3B
Alibaba
Year
2026
Tasks
General video understanding
Format
Video QA and understanding
Difficulty
Broad multimodal video reasoning
MLVU captures general-purpose video understanding rather than a single narrow skill. BenchLM tracks the mean-average summary row so provider comparison tables can be compared directly.
Version
MLVU (M-Avg) 2026
Refresh cadence
Quarterly
Staleness state
Current
Question availability
Public benchmark set
BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.
A multi-task video understanding benchmark averaged across MLVU categories.
Qwen3.6-27B by Alibaba currently leads with a score of 86.6% on MLVU (M-Avg).
2 AI models have been evaluated on MLVU (M-Avg) on BenchLM.
For engineers, researchers, and the plain curious — a weekly brief on new models, ranking shifts, and pricing changes.
Free. No spam. Unsubscribe anytime.