MMVU:衡量专家级多学科视频理解
MMVU: Measuring Expert-Level Multi-Discipline Video Understanding
January 21, 2025
作者: Yilun Zhao, Lujing Xie, Haowei Zhang, Guo Gan, Yitao Long, Zhiyuan Hu, Tongyan Hu, Weiyuan Chen, Chuhan Li, Junyang Song, Zhijian Xu, Chengye Wang, Weifeng Pan, Ziyao Shangguan, Xiangru Tang, Zhenwen Liang, Yixin Liu, Chen Zhao, Arman Cohan
cs.AI
摘要
我们介绍了MMVU,这是一个全面的专家级、多学科基准,用于评估视频理解中的基础模型。MMVU包括3,000个专家注释的问题,涵盖了四个核心学科中的27个主题:科学、医疗保健、人文社会科学和工程学。与先前的基准相比,MMVU具有三个关键进展。首先,它挑战模型应用领域特定知识,进行专家级推理,分析专业领域视频,超越了当前视频基准中通常评估的基本视觉感知。其次,每个示例都是由人类专家从头开始注释的。我们实施严格的数据质量控制,以确保数据集的高质量。最后,每个示例都附带有专家注释的推理原理和相关领域知识,促进深入分析。我们在MMVU上对32个前沿多模态基础模型进行了广泛评估。最新的System-2-capable模型,o1和Gemini 2.0 Flash Thinking,在测试模型中表现最佳。然而,它们仍然无法与人类专业知识匹敌。通过深入的错误分析和案例研究,我们为未来在专业领域的专家级、知识密集型视频理解方面的进展提供了可操作的见解。
English
We introduce MMVU, a comprehensive expert-level, multi-discipline benchmark
for evaluating foundation models in video understanding. MMVU includes 3,000
expert-annotated questions spanning 27 subjects across four core disciplines:
Science, Healthcare, Humanities & Social Sciences, and Engineering. Compared to
prior benchmarks, MMVU features three key advancements. First, it challenges
models to apply domain-specific knowledge and perform expert-level reasoning to
analyze specialized-domain videos, moving beyond the basic visual perception
typically assessed in current video benchmarks. Second, each example is
annotated by human experts from scratch. We implement strict data quality
controls to ensure the high quality of the dataset. Finally, each example is
enriched with expert-annotated reasoning rationals and relevant domain
knowledge, facilitating in-depth analysis. We conduct an extensive evaluation
of 32 frontier multimodal foundation models on MMVU. The latest
System-2-capable models, o1 and Gemini 2.0 Flash Thinking, achieve the highest
performance among the tested models. However, they still fall short of matching
human expertise. Through in-depth error analyses and case studies, we offer
actionable insights for future advancements in expert-level,
knowledge-intensive video understanding for specialized domains.Summary
AI-Generated Summary