金鱼:对任意长视频的视觉-语言理解
Goldfish: Vision-Language Understanding of Arbitrarily Long Videos
July 17, 2024
作者: Kirolos Ataallah, Xiaoqian Shen, Eslam Abdelrahman, Essam Sleiman, Mingchen Zhuge, Jian Ding, Deyao Zhu, Jürgen Schmidhuber, Mohamed Elhoseiny
cs.AI
摘要
大多数基于LLM的视频理解模型可以在几分钟内处理视频。然而,它们在处理长视频时面临挑战,诸如“噪声和冗余”,以及“内存和计算”限制。本文介绍了Goldfish,一种专为理解任意长度视频而设计的方法。我们还推出了TVQA-long基准,专门用于评估模型在理解长视频(包含视觉和文本内容问题)方面的能力。Goldfish通过高效的检索机制来解决这些挑战,该机制首先收集与指示相关的前k个视频片段,然后提供所需的响应。这种检索机制的设计使Goldfish能够高效处理任意长度的视频序列,便于在电影或电视系列等场景中应用。为了促进检索过程,我们开发了MiniGPT4-Video,用于为视频片段生成详细描述。为了解决长视频评估基准的稀缺性,我们将TVQA短视频基准进行了调整,通过整集问题的聚合来进行扩展内容分析,从而将评估从部分转变为完整集的理解。我们在TVQA-long基准上取得了41.78%的准确率,超过先前方法14.94%。我们的MiniGPT4-Video在短视频理解方面表现出色,分别在MSVD、MSRVTT、TGIF和TVQA短视频基准上超过现有最先进方法3.23%、2.03%、16.5%和23.59%。这些结果表明我们的模型在长视频和短视频理解方面有显著改进。我们的模型和代码已公开发布在https://vision-cair.github.io/Goldfish_website/。
English
Most current LLM-based models for video understanding can process videos
within minutes. However, they struggle with lengthy videos due to challenges
such as "noise and redundancy", as well as "memory and computation"
constraints. In this paper, we present Goldfish, a methodology tailored for
comprehending videos of arbitrary lengths. We also introduce the TVQA-long
benchmark, specifically designed to evaluate models' capabilities in
understanding long videos with questions in both vision and text content.
Goldfish approaches these challenges with an efficient retrieval mechanism that
initially gathers the top-k video clips relevant to the instruction before
proceeding to provide the desired response. This design of the retrieval
mechanism enables the Goldfish to efficiently process arbitrarily long video
sequences, facilitating its application in contexts such as movies or
television series. To facilitate the retrieval process, we developed
MiniGPT4-Video that generates detailed descriptions for the video clips. In
addressing the scarcity of benchmarks for long video evaluation, we adapted the
TVQA short video benchmark for extended content analysis by aggregating
questions from entire episodes, thereby shifting the evaluation from partial to
full episode comprehension. We attained a 41.78% accuracy rate on the TVQA-long
benchmark, surpassing previous methods by 14.94%. Our MiniGPT4-Video also shows
exceptional performance in short video comprehension, exceeding existing
state-of-the-art methods by 3.23%, 2.03%, 16.5% and 23.59% on the MSVD, MSRVTT,
TGIF, and TVQA short video benchmarks, respectively. These results indicate
that our models have significant improvements in both long and short-video
understanding. Our models and code have been made publicly available at
https://vision-cair.github.io/Goldfish_website/Summary
AI-Generated Summary