MMVU：衡量專家級多學科影片理解

摘要

我們介紹了MMVU，這是一個全面的專家級多學科基準，用於評估視頻理解中的基礎模型。MMVU包括3,000個專家標註的問題，涵蓋了四個核心學科的27個主題：科學、醫療保健、人文社會科學和工程學。與以往的基準相比，MMVU具有三個關鍵進展。首先，它挑戰模型應用領域特定知識，進行專家級推理，分析專業領域的視頻，超越了當前視頻基準通常評估的基本視覺感知。其次，每個示例都是由人類專家從頭標註的。我們實施嚴格的數據質量控制，以確保數據集的高質量。最後，每個示例都富含專家標註的推理依據和相關領域知識，促進深入分析。我們對32個前沿多模基礎模型在MMVU上進行了廣泛評估。最新的System-2-capable模型，o1和Gemini 2.0 Flash Thinking，在測試模型中取得了最高性能。然而，它們仍然無法與人類專業知識匹敵。通過深入的錯誤分析和案例研究，我們為未來在專家級、知識密集型視頻理解專業領域的進一步發展提供了可操作的見解。

English

We introduce MMVU, a comprehensive expert-level, multi-discipline benchmark for evaluating foundation models in video understanding. MMVU includes 3,000 expert-annotated questions spanning 27 subjects across four core disciplines: Science, Healthcare, Humanities & Social Sciences, and Engineering. Compared to prior benchmarks, MMVU features three key advancements. First, it challenges models to apply domain-specific knowledge and perform expert-level reasoning to analyze specialized-domain videos, moving beyond the basic visual perception typically assessed in current video benchmarks. Second, each example is annotated by human experts from scratch. We implement strict data quality controls to ensure the high quality of the dataset. Finally, each example is enriched with expert-annotated reasoning rationals and relevant domain knowledge, facilitating in-depth analysis. We conduct an extensive evaluation of 32 frontier multimodal foundation models on MMVU. The latest System-2-capable models, o1 and Gemini 2.0 Flash Thinking, achieve the highest performance among the tested models. However, they still fall short of matching human expertise. Through in-depth error analyses and case studies, we offer actionable insights for future advancements in expert-level, knowledge-intensive video understanding for specialized domains.

MMVU：衡量專家級多學科影片理解

MMVU: Measuring Expert-Level Multi-Discipline Video Understanding

摘要

Support