LongVU：面向长视频语言理解的时空自适应压缩

摘要

多模态大型语言模型（MLLMs）在理解和分析视频内容方面取得了令人期待的进展。然而，处理长视频仍然是一个重要挑战，受到LLM上下文大小的限制。为了解决这一限制，我们提出了LongVU，这是一种时空自适应压缩机制，可以减少视频标记的数量，同时保留长视频的视觉细节。我们的想法是基于利用跨模态查询和帧间依赖性，自适应地减少视频中的时间和空间冗余。具体而言，我们利用DINOv2特征去除显示高相似性的冗余帧。然后，我们利用文本引导的跨模态查询进行选择性帧特征减少。此外，我们根据它们的时间依赖性跨帧执行空间标记减少。我们的自适应压缩策略能够在给定的上下文长度内有效处理大量帧，几乎没有视觉信息丢失。我们的LongVU在各种视频理解基准测试中始终优于现有方法，特别是在诸如VideoMME和MLVU之类的长达一小时的视频理解任务上。鉴于轻量级LLM，我们的LongVU在保持最先进的视频理解性能的同时，也能有效地缩小体积。

English

Multimodal Large Language Models (MLLMs) have shown promising progress in understanding and analyzing video content. However, processing long videos remains a significant challenge constrained by LLM's context size. To address this limitation, we propose LongVU, a spatiotemporal adaptive compression mechanism thats reduces the number of video tokens while preserving visual details of long videos. Our idea is based on leveraging cross-modal query and inter-frame dependencies to adaptively reduce temporal and spatial redundancy in videos. Specifically, we leverage DINOv2 features to remove redundant frames that exhibit high similarity. Then we utilize text-guided cross-modal query for selective frame feature reduction. Further, we perform spatial token reduction across frames based on their temporal dependencies. Our adaptive compression strategy effectively processes a large number of frames with little visual information loss within given context length. Our LongVU consistently surpass existing methods across a variety of video understanding benchmarks, especially on hour-long video understanding tasks such as VideoMME and MLVU. Given a light-weight LLM, our LongVU also scales effectively into a smaller size with state-of-the-art video understanding performance.