視頻-MMMU:評估從多學科專業視頻中獲取知識
Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos
January 23, 2025
作者: Kairui Hu, Penghao Wu, Fanyi Pu, Wang Xiao, Yuanhan Zhang, Xiang Yue, Bo Li, Ziwei Liu
cs.AI
摘要
人類通過三個認知階段獲取知識:感知信息、理解知識,以及適應知識以解決新問題。視頻作為這種學習過程的有效媒介,有助於在這些認知階段之間進行進展。然而,現有的視頻基準未能系統評估大型多模型模型(LMMs)的知識獲取能力。為填補這一空白,我們引入了Video-MMMU,這是一個多模式、多學科基準,旨在評估LMMs從視頻中獲取和利用知識的能力。Video-MMMU 包含了300個專家級視頻和900個人工標註問題的精心收集,涵蓋六個學科,通過與階段對齊的問答對來評估知識獲取:感知、理解和適應。一個提出的知識增益指標,Δknowledge,量化了視頻觀看後性能的提高。對LMMs的評估顯示,在認知需求增加時,性能急劇下降,並突顯了人類和模型知識獲取之間的顯著差距,強調了需要改進LMMs從視頻中學習和適應的方法。
English
Humans acquire knowledge through three cognitive stages: perceiving
information, comprehending knowledge, and adapting knowledge to solve novel
problems. Videos serve as an effective medium for this learning process,
facilitating a progression through these cognitive stages. However, existing
video benchmarks fail to systematically evaluate the knowledge acquisition
capabilities in Large Multimodal Models (LMMs). To address this gap, we
introduce Video-MMMU, a multi-modal, multi-disciplinary benchmark designed to
assess LMMs' ability to acquire and utilize knowledge from videos. Video-MMMU
features a curated collection of 300 expert-level videos and 900
human-annotated questions across six disciplines, evaluating knowledge
acquisition through stage-aligned question-answer pairs: Perception,
Comprehension, and Adaptation. A proposed knowledge gain metric,
{\Delta}knowledge, quantifies improvement in performance after video viewing.
Evaluation of LMMs reveals a steep decline in performance as cognitive demands
increase and highlights a significant gap between human and model knowledge
acquisition, underscoring the need for methods to enhance LMMs' capability to
learn and adapt from videos.Summary
AI-Generated Summary