MMWorld:朝向多學科多面向世界模型在影片中的評估
MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation in Videos
June 12, 2024
作者: Xuehai He, Weixi Feng, Kaizhi Zheng, Yujie Lu, Wanrong Zhu, Jiachen Li, Yue Fan, Jianfeng Wang, Linjie Li, Zhengyuan Yang, Kevin Lin, William Yang Wang, Lijuan Wang, Xin Eric Wang
cs.AI
摘要
多模式語言模型(MLLMs)展示了「世界模型」的新興能力,即解釋和推理複雜的現實世界動態。為了評估這些能力,我們認為視頻是理想的媒介,因為它們包含了豐富的現實世界動態和因果關係的表示。為此,我們引入了MMWorld,這是一個新的用於多學科、多方面多模式視頻理解的基準。MMWorld通過兩個獨特優勢與先前的視頻理解基準區分開來:(1)多學科,涵蓋各種通常需要領域專業知識才能全面理解的學科;(2)多方面推理,包括解釋、反事實思考、未來預測等。MMWorld包括一個人工標註的數據集,用於通過關於整個視頻的問題來評估MLLMs,以及一個合成數據集,用於分析MLLMs在感知的單一模態內。總共,MMWorld包含了1,910個視頻,涵蓋七個廣泛的學科和69個子學科,共有6,627個問答對和相關標題。評估包括2個專有和10個開源MLLMs,這些模型在MMWorld上表現不佳(例如,GPT-4V的準確率僅為52.3%),顯示了有很大的改進空間。進一步的消融研究揭示了其他有趣的發現,例如模型與人類的不同技能組。我們希望MMWorld能成為評估視頻中世界模型的一個重要步驟。
English
Multimodal Language Language Models (MLLMs) demonstrate the emerging
abilities of "world models" -- interpreting and reasoning about complex
real-world dynamics. To assess these abilities, we posit videos are the ideal
medium, as they encapsulate rich representations of real-world dynamics and
causalities. To this end, we introduce MMWorld, a new benchmark for
multi-discipline, multi-faceted multimodal video understanding. MMWorld
distinguishes itself from previous video understanding benchmarks with two
unique advantages: (1) multi-discipline, covering various disciplines that
often require domain expertise for comprehensive understanding; (2)
multi-faceted reasoning, including explanation, counterfactual thinking, future
prediction, etc. MMWorld consists of a human-annotated dataset to evaluate
MLLMs with questions about the whole videos and a synthetic dataset to analyze
MLLMs within a single modality of perception. Together, MMWorld encompasses
1,910 videos across seven broad disciplines and 69 subdisciplines, complete
with 6,627 question-answer pairs and associated captions. The evaluation
includes 2 proprietary and 10 open-source MLLMs, which struggle on MMWorld
(e.g., GPT-4V performs the best with only 52.3\% accuracy), showing large room
for improvement. Further ablation studies reveal other interesting findings
such as models' different skill sets from humans. We hope MMWorld can serve as
an essential step towards world model evaluation in videos.Summary
AI-Generated Summary