金魚:對任意長度視頻的視覺語言理解
Goldfish: Vision-Language Understanding of Arbitrarily Long Videos
July 17, 2024
作者: Kirolos Ataallah, Xiaoqian Shen, Eslam Abdelrahman, Essam Sleiman, Mingchen Zhuge, Jian Ding, Deyao Zhu, Jürgen Schmidhuber, Mohamed Elhoseiny
cs.AI
摘要
目前大多數基於LLM的視頻理解模型可以在幾分鐘內處理視頻。然而,它們在處理長視頻時會遇到困難,原因在於"噪音和冗余"以及"內存和計算"等挑戰。本文介紹了Goldfish,這是一種專為理解任意長度視頻而設計的方法。我們還引入了TVQA-long基準測試,專門用於評估模型在理解具有視覺和文本內容問題的長視頻方面的能力。Goldfish通過一種高效的檢索機制應對這些挑戰,該機制首先收集與指示相關的前k個視頻片段,然後提供所需的回應。這種檢索機制的設計使Goldfish能夠高效處理任意長的視頻序列,從而促進其在電影或電視劇等情境中的應用。為了促進檢索過程,我們開發了MiniGPT4-Video,用於為視頻片段生成詳細描述。為了應對長視頻評估基準的稀缺性,我們通過匯總整個集數的問題,將TVQA短視頻基準進行了擴展內容分析,從而將評估從部分轉變為完整集數理解。我們在TVQA-long基準測試中實現了41.78%的準確率,超過先前方法14.94%。我們的MiniGPT4-Video在短視頻理解方面也表現出色,在MSVD、MSRVTT、TGIF和TVQA短視頻基準測試中分別超過現有最先進方法3.23%、2.03%、16.5%和23.59%。這些結果表明,我們的模型在長短視頻理解方面取得了顯著進步。我們的模型和代碼已公開在以下網址提供:https://vision-cair.github.io/Goldfish_website/
English
Most current LLM-based models for video understanding can process videos
within minutes. However, they struggle with lengthy videos due to challenges
such as "noise and redundancy", as well as "memory and computation"
constraints. In this paper, we present Goldfish, a methodology tailored for
comprehending videos of arbitrary lengths. We also introduce the TVQA-long
benchmark, specifically designed to evaluate models' capabilities in
understanding long videos with questions in both vision and text content.
Goldfish approaches these challenges with an efficient retrieval mechanism that
initially gathers the top-k video clips relevant to the instruction before
proceeding to provide the desired response. This design of the retrieval
mechanism enables the Goldfish to efficiently process arbitrarily long video
sequences, facilitating its application in contexts such as movies or
television series. To facilitate the retrieval process, we developed
MiniGPT4-Video that generates detailed descriptions for the video clips. In
addressing the scarcity of benchmarks for long video evaluation, we adapted the
TVQA short video benchmark for extended content analysis by aggregating
questions from entire episodes, thereby shifting the evaluation from partial to
full episode comprehension. We attained a 41.78% accuracy rate on the TVQA-long
benchmark, surpassing previous methods by 14.94%. Our MiniGPT4-Video also shows
exceptional performance in short video comprehension, exceeding existing
state-of-the-art methods by 3.23%, 2.03%, 16.5% and 23.59% on the MSVD, MSRVTT,
TGIF, and TVQA short video benchmarks, respectively. These results indicate
that our models have significant improvements in both long and short-video
understanding. Our models and code have been made publicly available at
https://vision-cair.github.io/Goldfish_website/Summary
AI-Generated Summary