考拉:關鍵幀條件下的長視頻LLM
Koala: Key frame-conditioned long video-LLM
April 5, 2024
作者: Reuben Tan, Ximeng Sun, Ping Hu, Jui-hsien Wang, Hanieh Deilamsalehy, Bryan A. Plummer, Bryan Russell, Kate Saenko
cs.AI
摘要
長視頻問答是一項具有挑戰性的任務,涉及識別短期活動並推理其細粒度關係。最先進的視頻大型語言模型(vLLMs)因其在新任務上展現的新興能力而被認為是一種可行的解決方案。然而,儘管在數百萬個短短幾秒的視頻上進行了訓練,vLLMs仍無法理解長達數分鐘的視頻並準確回答有關它們的問題。為了解決這一限制,我們提出了一種輕量級且自監督的方法,即關鍵幀條件下的長視頻-LLM(Koala),該方法引入了可學習的時空查詢,以使預訓練的vLLMs能夠泛化到更長的視頻。我們的方法引入了兩種新的分詞器,這些分詞器以從稀疏視頻關鍵幀計算的視覺標記為條件,以理解短視頻和長視頻片段。我們在HowTo100M上訓練了我們提出的方法,並在零-shot長視頻理解基準上展示了其有效性,在所有任務中,其絕對準確度比最先進的大型模型高出3-6%。令人驚訝的是,我們還在實證中表明,我們的方法不僅有助於預訓練的vLLM理解長視頻,還提高了其在短期動作識別方面的準確性。
English
Long video question answering is a challenging task that involves recognizing
short-term activities and reasoning about their fine-grained relationships.
State-of-the-art video Large Language Models (vLLMs) hold promise as a viable
solution due to their demonstrated emergent capabilities on new tasks. However,
despite being trained on millions of short seconds-long videos, vLLMs are
unable to understand minutes-long videos and accurately answer questions about
them. To address this limitation, we propose a lightweight and self-supervised
approach, Key frame-conditioned long video-LLM (Koala), that introduces
learnable spatiotemporal queries to adapt pretrained vLLMs for generalizing to
longer videos. Our approach introduces two new tokenizers that condition on
visual tokens computed from sparse video key frames for understanding short and
long video moments. We train our proposed approach on HowTo100M and demonstrate
its effectiveness on zero-shot long video understanding benchmarks, where it
outperforms state-of-the-art large models by 3 - 6% in absolute accuracy across
all tasks. Surprisingly, we also empirically show that our approach not only
helps a pretrained vLLM to understand long videos but also improves its
accuracy on short-term action recognition.Summary
AI-Generated Summary