MMVU:衡量專家級多學科影片理解
MMVU: Measuring Expert-Level Multi-Discipline Video Understanding
January 21, 2025
作者: Yilun Zhao, Lujing Xie, Haowei Zhang, Guo Gan, Yitao Long, Zhiyuan Hu, Tongyan Hu, Weiyuan Chen, Chuhan Li, Junyang Song, Zhijian Xu, Chengye Wang, Weifeng Pan, Ziyao Shangguan, Xiangru Tang, Zhenwen Liang, Yixin Liu, Chen Zhao, Arman Cohan
cs.AI
摘要
我們介紹了MMVU,這是一個全面的專家級多學科基準,用於評估視頻理解中的基礎模型。MMVU包括3,000個專家標註的問題,涵蓋了四個核心學科的27個主題:科學、醫療保健、人文社會科學和工程學。與以往的基準相比,MMVU具有三個關鍵進展。首先,它挑戰模型應用領域特定知識,進行專家級推理,分析專業領域的視頻,超越了當前視頻基準通常評估的基本視覺感知。其次,每個示例都是由人類專家從頭標註的。我們實施嚴格的數據質量控制,以確保數據集的高質量。最後,每個示例都富含專家標註的推理依據和相關領域知識,促進深入分析。我們對32個前沿多模基礎模型在MMVU上進行了廣泛評估。最新的System-2-capable模型,o1和Gemini 2.0 Flash Thinking,在測試模型中取得了最高性能。然而,它們仍然無法與人類專業知識匹敵。通過深入的錯誤分析和案例研究,我們為未來在專家級、知識密集型視頻理解專業領域的進一步發展提供了可操作的見解。
English
We introduce MMVU, a comprehensive expert-level, multi-discipline benchmark
for evaluating foundation models in video understanding. MMVU includes 3,000
expert-annotated questions spanning 27 subjects across four core disciplines:
Science, Healthcare, Humanities & Social Sciences, and Engineering. Compared to
prior benchmarks, MMVU features three key advancements. First, it challenges
models to apply domain-specific knowledge and perform expert-level reasoning to
analyze specialized-domain videos, moving beyond the basic visual perception
typically assessed in current video benchmarks. Second, each example is
annotated by human experts from scratch. We implement strict data quality
controls to ensure the high quality of the dataset. Finally, each example is
enriched with expert-annotated reasoning rationals and relevant domain
knowledge, facilitating in-depth analysis. We conduct an extensive evaluation
of 32 frontier multimodal foundation models on MMVU. The latest
System-2-capable models, o1 and Gemini 2.0 Flash Thinking, achieve the highest
performance among the tested models. However, they still fall short of matching
human expertise. Through in-depth error analyses and case studies, we offer
actionable insights for future advancements in expert-level,
knowledge-intensive video understanding for specialized domains.Summary
AI-Generated Summary