面向多模态大语言模型的令牌高效长视频理解

摘要

近期，基於影片的多模態大型語言模型（Video-LLMs）的進展，通過將影片處理為一系列圖像幀，顯著提升了影片理解能力。然而，許多現有方法在視覺骨幹中獨立處理每一幀，缺乏顯式的時間建模，這限制了它們捕捉動態模式並有效處理長影片的能力。為解決這些限制，我們提出了STORM（時空令牌減少用於多模態LLMs），這是一種新穎的架構，在圖像編碼器與LLM之間引入了一個專用的時間編碼器。我們的時間編碼器利用Mamba狀態空間模型，將時間信息整合到圖像令牌中，生成保留整個影片序列中幀間動態的豐富表示。這種豐富的編碼不僅增強了影片推理能力，還實現了有效的令牌減少策略，包括測試時採樣和基於訓練的時間與空間池化，大幅降低了對LLM的計算需求，而無需犧牲關鍵的時間信息。通過整合這些技術，我們的方法在減少訓練和推理延遲的同時提升了性能，實現了在長時間上下文中的高效且穩健的影片理解。廣泛的評估顯示，STORM在多個長影片理解基準測試中達到了最先進的結果（在MLVU和LongVideoBench上提升了超過5%），同時在固定輸入幀數的情況下，計算成本降低了高達8倍，解碼延遲減少了2.4至2.9倍。項目頁面可在https://research.nvidia.com/labs/lpr/storm查看。

English

Recent advances in video-based multimodal large language models (Video-LLMs) have significantly improved video understanding by processing videos as sequences of image frames. However, many existing methods treat frames independently in the vision backbone, lacking explicit temporal modeling, which limits their ability to capture dynamic patterns and efficiently handle long videos. To address these limitations, we introduce STORM (Spatiotemporal TOken Reduction for Multimodal LLMs), a novel architecture incorporating a dedicated temporal encoder between the image encoder and the LLM. Our temporal encoder leverages the Mamba State Space Model to integrate temporal information into image tokens, generating enriched representations that preserve inter-frame dynamics across the entire video sequence. This enriched encoding not only enhances video reasoning capabilities but also enables effective token reduction strategies, including test-time sampling and training-based temporal and spatial pooling, substantially reducing computational demands on the LLM without sacrificing key temporal information. By integrating these techniques, our approach simultaneously reduces training and inference latency while improving performance, enabling efficient and robust video understanding over extended temporal contexts. Extensive evaluations show that STORM achieves state-of-the-art results across various long video understanding benchmarks (more than 5\% improvement on MLVU and LongVideoBench) while reducing the computation costs by up to 8times and the decoding latency by 2.4-2.9times for the fixed numbers of input frames. Project page is available at https://research.nvidia.com/labs/lpr/storm

面向多模态大语言模型的令牌高效长视频理解

Token-Efficient Long Video Understanding for Multimodal LLMs

摘要

Support