TimeViper:一种基于Mamba-Transformer混合架构的高效长视频理解视觉语言模型
TimeViper: A Hybrid Mamba-Transformer Vision-Language Model for Efficient Long Video Understanding
November 20, 2025
作者: Boshen Xu, Zihan Xiao, Jiaze Li, Jianzhong Ju, Zhenbo Luo, Jian Luan, Qin Jin
cs.AI
摘要
我们推出TimeViper混合视觉语言模型,旨在解决长视频理解中的挑战。处理长视频既需要高效的模型架构,又需要有效的长时序上下文处理机制。为此,TimeViper采用混合Mamba-Transformer骨干网络,将状态空间模型的高效性与注意力机制的表现力相结合。通过这种混合设计,我们揭示了视觉到文本的信息聚合现象:随着大语言模型深度增加,信息从视觉标记逐步流向文本标记,导致视觉标记出现严重冗余。基于这一发现,我们提出TransV模块——一种通过指令标记实现视觉标记转移与压缩的令牌信息传输模块,在保持多模态理解能力的同时,可将视觉标记压缩至原有数量的1/64。该设计使TimeViper能够处理超过10,000帧、时长可达小时级的视频。在多基准测试中的大量实验表明,TimeViper在扩展处理帧数的同时仍可与最先进模型竞争。我们还进一步分析了Mamba和Transformer层的注意力行为,为混合模型的可解释性研究提供了新视角。这项工作是开发、解释和压缩混合Mamba-Transformer架构的初步探索。
English
We introduce TimeViper, a hybrid vision-language model designed to tackle challenges of long video understanding. Processing long videos demands both an efficient model architecture and an effective mechanism for handling extended temporal contexts. To this end, TimeViper adopts a hybrid Mamba-Transformer backbone that combines the efficiency of state-space models with the expressivity of attention mechanisms. Through this hybrid design, we reveal the vision-to-text information aggregation phenomenon, where information progressively flows from vision tokens to text tokens across increasing LLM depth, resulting in severe vision token redundancy. Motivated by this observation, we propose TransV, a token information transfer module that transfers and compresses vision tokens into instruction tokens while maintaining multimodal understanding capabilities. This design enables TimeViper to process hour-long videos exceeding 10,000 frames. Extensive experiments across multiple benchmarks demonstrate that TimeViper competes with state-of-the-art models while extending frame numbers. We further analyze attention behaviors of both Mamba and Transformer layers, offering new insights into hybrid model interpretability. This work represents an initial step towards developing, interpreting, and compressing hybrid Mamba-Transformer architectures.