Vamba：利用混合Mamba-Transformer架构理解时长一小时视频

摘要

基於Transformer的大型多模態模型（LMMs）在處理長達一小時的視頻輸入時面臨挑戰，這是由於因果自注意力操作的二次方複雜性，導致訓練和推理過程中的高計算成本。現有的基於令牌壓縮的方法雖然減少了視頻令牌的數量，但往往會造成信息損失，並且在處理極長序列時仍然效率低下。本文探索了一種正交方向，構建了一種混合Mamba-Transformer模型（VAMBA），該模型利用Mamba-2塊以線性複雜度對視頻令牌進行編碼。在沒有任何令牌減少的情況下，VAMBA可以在單個GPU上編碼超過1024幀（640×360）的視頻，而基於Transformer的模型僅能編碼256幀。在長視頻輸入上，VAMBA在訓練和推理過程中至少減少了50%的GPU內存使用量，並且每訓練步驟的速度幾乎是基於Transformer的LMMs的兩倍。我們的實驗結果表明，在具有挑戰性的一小時視頻理解基準LVBench上，VAMBA相比之前的高效視頻LMMs提高了4.3%的準確率，並在廣泛的長短視頻理解任務中保持了強勁的性能。

English

State-of-the-art transformer-based large multimodal models (LMMs) struggle to handle hour-long video inputs due to the quadratic complexity of the causal self-attention operations, leading to high computational costs during training and inference. Existing token compression-based methods reduce the number of video tokens but often incur information loss and remain inefficient for extremely long sequences. In this paper, we explore an orthogonal direction to build a hybrid Mamba-Transformer model (VAMBA) that employs Mamba-2 blocks to encode video tokens with linear complexity. Without any token reduction, VAMBA can encode more than 1024 frames (640times360) on a single GPU, while transformer-based models can only encode 256 frames. On long video input, VAMBA achieves at least 50% reduction in GPU memory usage during training and inference, and nearly doubles the speed per training step compared to transformer-based LMMs. Our experimental results demonstrate that VAMBA improves accuracy by 4.3% on the challenging hour-long video understanding benchmark LVBench over prior efficient video LMMs, and maintains strong performance on a broad spectrum of long and short video understanding tasks.

Vamba：利用混合Mamba-Transformer架构理解时长一小时视频

Vamba: Understanding Hour-Long Videos with Hybrid Mamba-Transformers

摘要

Support