視訊瑪巴套件：狀態空間模型作為視訊理解的多功能替代方案

摘要

理解影片是計算機視覺研究中的基本方向之一，人們已經付出了大量努力來探索各種架構，例如RNN、3D CNN和Transformers。新提出的狀態空間模型架構，例如Mamba，展現了將其在長序列建模成功擴展到影片建模的潛力。為了評估Mamba在影片理解領域是否可以成為Transformers的可行替代方案，在這項工作中，我們進行了一系列全面的研究，探討Mamba在影片建模中可以扮演的不同角色，同時調查Mamba可能展現優越性能的各種任務。我們將Mamba分為四種影片建模角色，推導出由14個模型/模組組成的Video Mamba Suite，並在12個影片理解任務上對其進行評估。我們的廣泛實驗揭示了Mamba在僅影片和影片-語言任務上的巨大潛力，同時展示了有前途的效率-性能折衷。我們希望這項工作可以為未來關於影片理解的研究提供有價值的數據和見解。代碼公開：https://github.com/OpenGVLab/video-mamba-suite。

English

Understanding videos is one of the fundamental directions in computer vision research, with extensive efforts dedicated to exploring various architectures such as RNN, 3D CNN, and Transformers. The newly proposed architecture of state space model, e.g., Mamba, shows promising traits to extend its success in long sequence modeling to video modeling. To assess whether Mamba can be a viable alternative to Transformers in the video understanding domain, in this work, we conduct a comprehensive set of studies, probing different roles Mamba can play in modeling videos, while investigating diverse tasks where Mamba could exhibit superiority. We categorize Mamba into four roles for modeling videos, deriving a Video Mamba Suite composed of 14 models/modules, and evaluating them on 12 video understanding tasks. Our extensive experiments reveal the strong potential of Mamba on both video-only and video-language tasks while showing promising efficiency-performance trade-offs. We hope this work could provide valuable data points and insights for future research on video understanding. Code is public: https://github.com/OpenGVLab/video-mamba-suite.

視訊瑪巴套件：狀態空間模型作為視訊理解的多功能替代方案

Video Mamba Suite: State Space Model as a Versatile Alternative for Video Understanding

摘要

Support