Vinoground:通过短视频对密集时间推理中的LMMs进行审查
Vinoground: Scrutinizing LMMs over Dense Temporal Reasoning with Short Videos
October 3, 2024
作者: Jianrui Zhang, Mu Cai, Yong Jae Lee
cs.AI
摘要
最近越来越多的人认为现代大型多模态模型(LMMs)已经解决了与短视频理解相关的大部分关键挑战。因此,学术界和工业界逐渐将注意力转向理解长视频所带来的更复杂挑战。然而,事实真的是这样吗?我们的研究表明,即使处理短视频,LMMs仍然缺乏许多基本的推理能力。我们引入了Vinoground,一个包含1000个短自然视频-字幕对的时间反事实LMM评估基准。我们展示现有的LMMs在区分不同动作和物体转换之间的时间差异方面遇到了严重困难。例如,最佳模型GPT-4o只在我们的文本和视频分数上获得了约50%,与人类基准约90%相比存在很大差距。所有开源多模态模型和基于CLIP的模型表现得更差,主要产生随机的准确率。通过这项工作,我们揭示了短视频中的时间推理仍然是一个尚未完全解决的问题。数据集和评估代码可在https://vinoground.github.io 上获得。
English
There has been growing sentiment recently that modern large multimodal models
(LMMs) have addressed most of the key challenges related to short video
comprehension. As a result, both academia and industry are gradually shifting
their attention towards the more complex challenges posed by understanding
long-form videos. However, is this really the case? Our studies indicate that
LMMs still lack many fundamental reasoning capabilities even when dealing with
short videos. We introduce Vinoground, a temporal counterfactual LMM evaluation
benchmark encompassing 1000 short and natural video-caption pairs. We
demonstrate that existing LMMs severely struggle to distinguish temporal
differences between different actions and object transformations. For example,
the best model GPT-4o only obtains ~50% on our text and video scores, showing a
large gap compared to the human baseline of ~90%. All open-source multimodal
models and CLIP-based models perform much worse, producing mostly random chance
performance. Through this work, we shed light onto the fact that temporal
reasoning in short videos is a problem yet to be fully solved. The dataset and
evaluation code are available at https://vinoground.github.io.Summary
AI-Generated Summary