视觉上下文窗口扩展:长视频理解的新视角
Visual Context Window Extension: A New Perspective for Long Video Understanding
September 30, 2024
作者: Hongchen Wei, Zhenzhong Chen
cs.AI
摘要
大型多模态模型(LMMs)在短视频理解任务中展现出令人印象深刻的性能,但在长视频理解方面面临巨大挑战。相比之下,大型语言模型(LLMs)在建模长文本方面表现出色。现有研究尝试通过在训练过程中引入长视频文本对来解决这一问题。然而,这些方法需要大量的计算资源和数据资源。本文从上下文窗口的角度解决了长视频理解的挑战,旨在将LMMs应用于长视频任务,而无需在大规模长视频数据集上重新训练。我们首先深入分析了为什么预训练的LMMs难以理解冗长视频内容,发现视觉和语言模态之间的差异导致视觉和语言标记具有不同的上下文窗口,使得直接扩展视觉标记以匹配语言上下文窗口变得困难。基于此,我们提出通过扩展视觉上下文窗口,适应LMMs用于长视频理解任务,从而消除了在大规模长视频数据集上重新训练的需求。为了进一步减轻长序列带来的显著内存消耗,我们引入了一种渐进式池化推理策略,有选择地调整帧嵌入的空间分辨率,减少视觉标记数量同时保留重要的空间信息。在多个长视频理解基准测试中,我们的方法在视频帧数量增加时始终提高性能。在MLVU基准测试中,即使我们的模型大小仅为7B,我们的方法也优于GPT-4o。此外,在256帧设置中,我们的方法相较于基准,将内存使用量降低约45%,而不会引入任何性能损失。
English
Large Multimodal Models (LMMs) have demonstrated impressive performance in
short video understanding tasks but face great challenges when applied to long
video understanding. In contrast, Large Language Models (LLMs) exhibit
outstanding capabilities in modeling long texts. Existing work attempts to
address this issue by introducing long video-text pairs during training.
However, these approaches require substantial computational and data resources.
In this paper, we tackle the challenge of long video understanding from the
perspective of context windows, aiming to apply LMMs to long video tasks
without retraining on long video datasets. We first conduct an in-depth
analysis of why pretrained LMMs struggle to understand lengthy video content,
identifying that discrepancies between visual and language modalities lead to
different context windows for visual and language tokens, making it difficult
to directly extend the visual tokens to match the language context window.
Based on this, we propose to adapt LMMs for long video understanding tasks by
extending the visual context window, eliminating the need for retraining on
large scalelong video datasets. To further mitigate the significant memory
consumption caused by long sequences, we introduce a progressive pooling
inference strategy that selectively adjusts the spatial resolution of frame
embeddings, reducing the number of visual tokens while retaining important
spatial information. Across multiple long video understanding benchmarks, our
method consistently improves the performance as the number of video frames
increases. On the MLVU benchmark, our method outperforms GPT-4o, even though
our model size is only 7B. Additionally, in the 256-frame setting, our method
reduces memory usage by approximately 45% compared to the baseline, without
introducing any performance loss.Summary
AI-Generated Summary