TimeViper：一种基于Mamba-Transformer混合架构的高效长视频理解视觉语言模型

摘要

我们推出TimeViper——一种面向长视频理解挑战设计的混合视觉语言模型。处理长视频既需要高效的模型架构，又需要能有效处理长时域上下文的机制。为此，TimeViper采用混合Mamba-Transformer主干网络，将状态空间模型的高效性与注意力机制的表现力相结合。通过这种混合设计，我们揭示了视觉到文本的信息汇聚现象：随着大语言模型层深增加，信息会从视觉标记逐步流向文本标记，导致视觉标记出现严重冗余。基于这一发现，我们提出TransV模块——一种能在保持多模态理解能力的同时，将视觉标记转移并压缩至指令标记的令牌信息传输模块。该设计使TimeViper能处理超过10,000帧、时长可达小时级的视频。在多个基准测试上的广泛实验表明，TimeViper在显著扩展处理帧数的同时，仍可与最先进模型竞争。我们还深入分析了Mamba层与Transformer层的注意力机制，为混合模型的可解释性研究提供了新视角。本工作标志着向开发、解析和压缩混合Mamba-Transformer架构迈出了重要一步。

English

We introduce TimeViper, a hybrid vision-language model designed to tackle challenges of long video understanding. Processing long videos demands both an efficient model architecture and an effective mechanism for handling extended temporal contexts. To this end, TimeViper adopts a hybrid Mamba-Transformer backbone that combines the efficiency of state-space models with the expressivity of attention mechanisms. Through this hybrid design, we reveal the vision-to-text information aggregation phenomenon, where information progressively flows from vision tokens to text tokens across increasing LLM depth, resulting in severe vision token redundancy. Motivated by this observation, we propose TransV, a token information transfer module that transfers and compresses vision tokens into instruction tokens while maintaining multimodal understanding capabilities. This design enables TimeViper to process hour-long videos exceeding 10,000 frames. Extensive experiments across multiple benchmarks demonstrate that TimeViper competes with state-of-the-art models while extending frame numbers. We further analyze attention behaviors of both Mamba and Transformer layers, offering new insights into hybrid model interpretability. This work represents an initial step towards developing, interpreting, and compressing hybrid Mamba-Transformer architectures.

TimeViper：一种基于Mamba-Transformer混合架构的高效长视频理解视觉语言模型

TimeViper: A Hybrid Mamba-Transformer Vision-Language Model for Efficient Long Video Understanding

摘要

Support