MMWorld:面向视频中多学科多方面世界模型评估
MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation in Videos
June 12, 2024
作者: Xuehai He, Weixi Feng, Kaizhi Zheng, Yujie Lu, Wanrong Zhu, Jiachen Li, Yue Fan, Jianfeng Wang, Linjie Li, Zhengyuan Yang, Kevin Lin, William Yang Wang, Lijuan Wang, Xin Eric Wang
cs.AI
摘要
多模态语言模型(MLLMs)展示了“世界模型”的新兴能力——解释和推理复杂的现实世界动态。为了评估这些能力,我们认为视频是理想的媒介,因为它们包含了丰富的现实世界动态和因果关系表示。为此,我们引入了MMWorld,这是一个用于多学科、多方面的多模态视频理解的新基准。MMWorld通过两个独特优势与先前的视频理解基准有所区别:(1)多学科,涵盖通常需要领域专业知识才能全面理解的各种学科;(2)多方面推理,包括解释、反事实思考、未来预测等。MMWorld包括一个人工注释的数据集,用于评估MLLMs对整个视频的问题,以及一个合成数据集,用于分析MLLMs在单一感知模态内的表现。总体而言,MMWorld涵盖了来自七个广泛学科和69个子学科的1,910个视频,配有6,627个问题-答案对和相关字幕。评估包括2个专有和10个开源MLLMs,它们在MMWorld上表现不佳(例如,GPT-4V的准确率仅为52.3%),显示出有很大的改进空间。进一步的消融研究揭示了其他有趣的发现,比如模型与人类的不同技能集。我们希望MMWorld能成为视频中世界模型评估的一个重要步骤。
English
Multimodal Language Language Models (MLLMs) demonstrate the emerging
abilities of "world models" -- interpreting and reasoning about complex
real-world dynamics. To assess these abilities, we posit videos are the ideal
medium, as they encapsulate rich representations of real-world dynamics and
causalities. To this end, we introduce MMWorld, a new benchmark for
multi-discipline, multi-faceted multimodal video understanding. MMWorld
distinguishes itself from previous video understanding benchmarks with two
unique advantages: (1) multi-discipline, covering various disciplines that
often require domain expertise for comprehensive understanding; (2)
multi-faceted reasoning, including explanation, counterfactual thinking, future
prediction, etc. MMWorld consists of a human-annotated dataset to evaluate
MLLMs with questions about the whole videos and a synthetic dataset to analyze
MLLMs within a single modality of perception. Together, MMWorld encompasses
1,910 videos across seven broad disciplines and 69 subdisciplines, complete
with 6,627 question-answer pairs and associated captions. The evaluation
includes 2 proprietary and 10 open-source MLLMs, which struggle on MMWorld
(e.g., GPT-4V performs the best with only 52.3\% accuracy), showing large room
for improvement. Further ablation studies reveal other interesting findings
such as models' different skill sets from humans. We hope MMWorld can serve as
an essential step towards world model evaluation in videos.Summary
AI-Generated Summary