Vamba:利用混合Mamba-Transformer架构理解时长一小时的视频
Vamba: Understanding Hour-Long Videos with Hybrid Mamba-Transformers
March 14, 2025
作者: Weiming Ren, Wentao Ma, Huan Yang, Cong Wei, Ge Zhang, Wenhu Chen
cs.AI
摘要
当前基于Transformer的大型多模态模型(LMMs)在处理长达一小时的视频输入时面临挑战,这主要归因于因果自注意力操作的二次方复杂度,导致训练和推理过程中的计算成本高昂。现有的基于令牌压缩的方法虽减少了视频令牌的数量,但往往伴随信息丢失,且在处理极长序列时效率依然低下。本文探索了一种全新的方向,构建了一种混合Mamba-Transformer模型(VAMBA),该模型采用Mamba-2模块以线性复杂度编码视频令牌。在不进行任何令牌缩减的情况下,VAMBA能够在单个GPU上编码超过1024帧(640×360分辨率)的视频,而基于Transformer的模型仅能编码256帧。对于长视频输入,VAMBA在训练和推理过程中至少减少了50%的GPU内存使用,且每训练步骤的速度几乎翻倍,相较于基于Transformer的LMMs。实验结果表明,在具有挑战性的一小时视频理解基准LVBench上,VAMBA相较于先前的高效视频LMMs提升了4.3%的准确率,并在广泛的长短视频理解任务中保持了强劲的性能。
English
State-of-the-art transformer-based large multimodal models (LMMs) struggle to
handle hour-long video inputs due to the quadratic complexity of the causal
self-attention operations, leading to high computational costs during training
and inference. Existing token compression-based methods reduce the number of
video tokens but often incur information loss and remain inefficient for
extremely long sequences. In this paper, we explore an orthogonal direction to
build a hybrid Mamba-Transformer model (VAMBA) that employs Mamba-2 blocks to
encode video tokens with linear complexity. Without any token reduction, VAMBA
can encode more than 1024 frames (640times360) on a single GPU, while
transformer-based models can only encode 256 frames. On long video input, VAMBA
achieves at least 50% reduction in GPU memory usage during training and
inference, and nearly doubles the speed per training step compared to
transformer-based LMMs. Our experimental results demonstrate that VAMBA
improves accuracy by 4.3% on the challenging hour-long video understanding
benchmark LVBench over prior efficient video LMMs, and maintains strong
performance on a broad spectrum of long and short video understanding tasks.Summary
AI-Generated Summary