ChatPaper.aiChatPaper

考拉:关键帧条件下的长视频LLM

Koala: Key frame-conditioned long video-LLM

April 5, 2024
作者: Reuben Tan, Ximeng Sun, Ping Hu, Jui-hsien Wang, Hanieh Deilamsalehy, Bryan A. Plummer, Bryan Russell, Kate Saenko
cs.AI

摘要

长视频问答是一个具有挑战性的任务,涉及识别短期活动并推理其细粒度关系。最先进的视频大型语言模型(vLLMs)因其在新任务上展现出的新兴能力而被认为是一种可行的解决方案。然而,尽管在数百万短短几秒钟的视频上进行了训练,vLLMs 无法理解几分钟长的视频并准确回答有关其的问题。为了解决这一局限性,我们提出了一种轻量级和自监督的方法,即关键帧条件长视频-LLM(Koala),它引入了可学习的时空查询,以使预训练的 vLLMs 能够泛化到更长的视频。我们的方法引入了两个新的分词器,这些分词器以从稀疏视频关键帧计算的视觉标记为条件,用于理解短视频和长视频片段。我们在 HowTo100M 上训练了我们提出的方法,并在零样本长视频理解基准测试中展示了其有效性,在所有任务中的绝对准确度上超过最先进的大型模型 3 - 6%。令人惊讶的是,我们还凭经验证明,我们的方法不仅帮助预训练的 vLLM 理解长视频,还提高了其在短期动作识别上的准确性。
English
Long video question answering is a challenging task that involves recognizing short-term activities and reasoning about their fine-grained relationships. State-of-the-art video Large Language Models (vLLMs) hold promise as a viable solution due to their demonstrated emergent capabilities on new tasks. However, despite being trained on millions of short seconds-long videos, vLLMs are unable to understand minutes-long videos and accurately answer questions about them. To address this limitation, we propose a lightweight and self-supervised approach, Key frame-conditioned long video-LLM (Koala), that introduces learnable spatiotemporal queries to adapt pretrained vLLMs for generalizing to longer videos. Our approach introduces two new tokenizers that condition on visual tokens computed from sparse video key frames for understanding short and long video moments. We train our proposed approach on HowTo100M and demonstrate its effectiveness on zero-shot long video understanding benchmarks, where it outperforms state-of-the-art large models by 3 - 6% in absolute accuracy across all tasks. Surprisingly, we also empirically show that our approach not only helps a pretrained vLLM to understand long videos but also improves its accuracy on short-term action recognition.

Summary

AI-Generated Summary

PDF72December 15, 2024