ChatPaper.aiChatPaper

Vamba:利用混合Mamba-Transformer架构理解时长一小时视频

Vamba: Understanding Hour-Long Videos with Hybrid Mamba-Transformers

March 14, 2025
作者: Weiming Ren, Wentao Ma, Huan Yang, Cong Wei, Ge Zhang, Wenhu Chen
cs.AI

摘要

基於Transformer的大型多模態模型(LMMs)在處理長達一小時的視頻輸入時面臨挑戰,這是由於因果自注意力操作的二次方複雜性,導致訓練和推理過程中的高計算成本。現有的基於令牌壓縮的方法雖然減少了視頻令牌的數量,但往往會造成信息損失,並且在處理極長序列時仍然效率低下。本文探索了一種正交方向,構建了一種混合Mamba-Transformer模型(VAMBA),該模型利用Mamba-2塊以線性複雜度對視頻令牌進行編碼。在沒有任何令牌減少的情況下,VAMBA可以在單個GPU上編碼超過1024幀(640×360)的視頻,而基於Transformer的模型僅能編碼256幀。在長視頻輸入上,VAMBA在訓練和推理過程中至少減少了50%的GPU內存使用量,並且每訓練步驟的速度幾乎是基於Transformer的LMMs的兩倍。我們的實驗結果表明,在具有挑戰性的一小時視頻理解基準LVBench上,VAMBA相比之前的高效視頻LMMs提高了4.3%的準確率,並在廣泛的長短視頻理解任務中保持了強勁的性能。
English
State-of-the-art transformer-based large multimodal models (LMMs) struggle to handle hour-long video inputs due to the quadratic complexity of the causal self-attention operations, leading to high computational costs during training and inference. Existing token compression-based methods reduce the number of video tokens but often incur information loss and remain inefficient for extremely long sequences. In this paper, we explore an orthogonal direction to build a hybrid Mamba-Transformer model (VAMBA) that employs Mamba-2 blocks to encode video tokens with linear complexity. Without any token reduction, VAMBA can encode more than 1024 frames (640times360) on a single GPU, while transformer-based models can only encode 256 frames. On long video input, VAMBA achieves at least 50% reduction in GPU memory usage during training and inference, and nearly doubles the speed per training step compared to transformer-based LMMs. Our experimental results demonstrate that VAMBA improves accuracy by 4.3% on the challenging hour-long video understanding benchmark LVBench over prior efficient video LMMs, and maintains strong performance on a broad spectrum of long and short video understanding tasks.

Summary

AI-Generated Summary

PDF202March 17, 2025