ChatPaper.aiChatPaper

視頻-MMMU:評估從多學科專業視頻中獲取知識

Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos

January 23, 2025
作者: Kairui Hu, Penghao Wu, Fanyi Pu, Wang Xiao, Yuanhan Zhang, Xiang Yue, Bo Li, Ziwei Liu
cs.AI

摘要

人類通過三個認知階段獲取知識:感知信息、理解知識,以及適應知識以解決新問題。視頻作為這種學習過程的有效媒介,有助於在這些認知階段之間進行進展。然而,現有的視頻基準未能系統評估大型多模型模型(LMMs)的知識獲取能力。為填補這一空白,我們引入了Video-MMMU,這是一個多模式、多學科基準,旨在評估LMMs從視頻中獲取和利用知識的能力。Video-MMMU 包含了300個專家級視頻和900個人工標註問題的精心收集,涵蓋六個學科,通過與階段對齊的問答對來評估知識獲取:感知、理解和適應。一個提出的知識增益指標,Δknowledge,量化了視頻觀看後性能的提高。對LMMs的評估顯示,在認知需求增加時,性能急劇下降,並突顯了人類和模型知識獲取之間的顯著差距,強調了需要改進LMMs從視頻中學習和適應的方法。
English
Humans acquire knowledge through three cognitive stages: perceiving information, comprehending knowledge, and adapting knowledge to solve novel problems. Videos serve as an effective medium for this learning process, facilitating a progression through these cognitive stages. However, existing video benchmarks fail to systematically evaluate the knowledge acquisition capabilities in Large Multimodal Models (LMMs). To address this gap, we introduce Video-MMMU, a multi-modal, multi-disciplinary benchmark designed to assess LMMs' ability to acquire and utilize knowledge from videos. Video-MMMU features a curated collection of 300 expert-level videos and 900 human-annotated questions across six disciplines, evaluating knowledge acquisition through stage-aligned question-answer pairs: Perception, Comprehension, and Adaptation. A proposed knowledge gain metric, {\Delta}knowledge, quantifies improvement in performance after video viewing. Evaluation of LMMs reveals a steep decline in performance as cognitive demands increase and highlights a significant gap between human and model knowledge acquisition, underscoring the need for methods to enhance LMMs' capability to learn and adapt from videos.

Summary

AI-Generated Summary

PDF262January 24, 2025