LongVU:面向长视频语言理解的时空自适应压缩
LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding
October 22, 2024
作者: Xiaoqian Shen, Yunyang Xiong, Changsheng Zhao, Lemeng Wu, Jun Chen, Chenchen Zhu, Zechun Liu, Fanyi Xiao, Balakrishnan Varadarajan, Florian Bordes, Zhuang Liu, Hu Xu, Hyunwoo J. Kim, Bilge Soran, Raghuraman Krishnamoorthi, Mohamed Elhoseiny, Vikas Chandra
cs.AI
摘要
多模态大型语言模型(MLLMs)在理解和分析视频内容方面取得了令人期待的进展。然而,处理长视频仍然是一个重要挑战,受到LLM上下文大小的限制。为了解决这一限制,我们提出了LongVU,这是一种时空自适应压缩机制,可以减少视频标记的数量,同时保留长视频的视觉细节。我们的想法是基于利用跨模态查询和帧间依赖性,自适应地减少视频中的时间和空间冗余。具体而言,我们利用DINOv2特征去除显示高相似性的冗余帧。然后,我们利用文本引导的跨模态查询进行选择性帧特征减少。此外,我们根据它们的时间依赖性跨帧执行空间标记减少。我们的自适应压缩策略能够在给定的上下文长度内有效处理大量帧,几乎没有视觉信息丢失。我们的LongVU在各种视频理解基准测试中始终优于现有方法,特别是在诸如VideoMME和MLVU之类的长达一小时的视频理解任务上。鉴于轻量级LLM,我们的LongVU在保持最先进的视频理解性能的同时,也能有效地缩小体积。
English
Multimodal Large Language Models (MLLMs) have shown promising progress in
understanding and analyzing video content. However, processing long videos
remains a significant challenge constrained by LLM's context size. To address
this limitation, we propose LongVU, a spatiotemporal adaptive compression
mechanism thats reduces the number of video tokens while preserving visual
details of long videos. Our idea is based on leveraging cross-modal query and
inter-frame dependencies to adaptively reduce temporal and spatial redundancy
in videos. Specifically, we leverage DINOv2 features to remove redundant frames
that exhibit high similarity. Then we utilize text-guided cross-modal query for
selective frame feature reduction. Further, we perform spatial token reduction
across frames based on their temporal dependencies. Our adaptive compression
strategy effectively processes a large number of frames with little visual
information loss within given context length. Our LongVU consistently surpass
existing methods across a variety of video understanding benchmarks, especially
on hour-long video understanding tasks such as VideoMME and MLVU. Given a
light-weight LLM, our LongVU also scales effectively into a smaller size with
state-of-the-art video understanding performance.Summary
AI-Generated Summary