ChatPaper.aiChatPaper

LongVU:面向长视频语言理解的时空自适应压缩

LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding

October 22, 2024
作者: Xiaoqian Shen, Yunyang Xiong, Changsheng Zhao, Lemeng Wu, Jun Chen, Chenchen Zhu, Zechun Liu, Fanyi Xiao, Balakrishnan Varadarajan, Florian Bordes, Zhuang Liu, Hu Xu, Hyunwoo J. Kim, Bilge Soran, Raghuraman Krishnamoorthi, Mohamed Elhoseiny, Vikas Chandra
cs.AI

摘要

多模态大型语言模型(MLLMs)在理解和分析视频内容方面取得了令人期待的进展。然而,处理长视频仍然是一个重要挑战,受到LLM上下文大小的限制。为了解决这一限制,我们提出了LongVU,这是一种时空自适应压缩机制,可以减少视频标记的数量,同时保留长视频的视觉细节。我们的想法是基于利用跨模态查询和帧间依赖性,自适应地减少视频中的时间和空间冗余。具体而言,我们利用DINOv2特征去除显示高相似性的冗余帧。然后,我们利用文本引导的跨模态查询进行选择性帧特征减少。此外,我们根据它们的时间依赖性跨帧执行空间标记减少。我们的自适应压缩策略能够在给定的上下文长度内有效处理大量帧,几乎没有视觉信息丢失。我们的LongVU在各种视频理解基准测试中始终优于现有方法,特别是在诸如VideoMME和MLVU之类的长达一小时的视频理解任务上。鉴于轻量级LLM,我们的LongVU在保持最先进的视频理解性能的同时,也能有效地缩小体积。
English
Multimodal Large Language Models (MLLMs) have shown promising progress in understanding and analyzing video content. However, processing long videos remains a significant challenge constrained by LLM's context size. To address this limitation, we propose LongVU, a spatiotemporal adaptive compression mechanism thats reduces the number of video tokens while preserving visual details of long videos. Our idea is based on leveraging cross-modal query and inter-frame dependencies to adaptively reduce temporal and spatial redundancy in videos. Specifically, we leverage DINOv2 features to remove redundant frames that exhibit high similarity. Then we utilize text-guided cross-modal query for selective frame feature reduction. Further, we perform spatial token reduction across frames based on their temporal dependencies. Our adaptive compression strategy effectively processes a large number of frames with little visual information loss within given context length. Our LongVU consistently surpass existing methods across a variety of video understanding benchmarks, especially on hour-long video understanding tasks such as VideoMME and MLVU. Given a light-weight LLM, our LongVU also scales effectively into a smaller size with state-of-the-art video understanding performance.

Summary

AI-Generated Summary

PDF292November 16, 2024