ChatPaper.aiChatPaper

MVU-Eval:面向多模态大模型的多视频理解评估框架

MVU-Eval: Towards Multi-Video Understanding Evaluation for Multimodal LLMs

November 10, 2025
作者: Tianhao Peng, Haochen Wang, Yuanxing Zhang, Zekun Wang, Zili Wang, Ge Zhang, Jian Yang, Shihao Li, Yanghai Wang, Xintao Wang, Houyi Li, Wei Ji, Pengfei Wan, Wenhao Huang, Zhaoxiang Zhang, Jiaheng Liu
cs.AI

摘要

多模态大语言模型(MLLM)的出现将AI能力扩展至视觉模态,然而现有评估基准仍局限于单视频理解,忽视了现实场景(如体育赛事分析与自动驾驶)中对多视频理解的关键需求。为填补这一重要空白,我们推出MVU-Eval——首个用于评估MLLM多视频理解能力的综合基准。该基准通过来自多元领域的4,959个视频中的1,824个精心构建的问答对,系统评估八大核心能力,涵盖基础感知任务与高阶推理任务。这些能力严格对标自动驾驶系统中的多传感器融合、跨视角体育分析等实际应用场景。通过对顶尖开源与闭源模型的广泛测试,我们揭示了当前MLLM在多视频理解能力上存在的显著性能差异与局限。本基准将公开共享以推动后续研究。
English
The advent of Multimodal Large Language Models (MLLMs) has expanded AI capabilities to visual modalities, yet existing evaluation benchmarks remain limited to single-video understanding, overlooking the critical need for multi-video understanding in real-world scenarios (e.g., sports analytics and autonomous driving). To address this significant gap, we introduce MVU-Eval, the first comprehensive benchmark for evaluating Multi-Video Understanding for MLLMs. Specifically, our MVU-Eval mainly assesses eight core competencies through 1,824 meticulously curated question-answer pairs spanning 4,959 videos from diverse domains, addressing both fundamental perception tasks and high-order reasoning tasks. These capabilities are rigorously aligned with real-world applications such as multi-sensor synthesis in autonomous systems and cross-angle sports analytics. Through extensive evaluation of state-of-the-art open-source and closed-source models, we reveal significant performance discrepancies and limitations in current MLLMs' ability to perform understanding across multiple videos. The benchmark will be made publicly available to foster future research.
PDF172December 2, 2025