VideoMamba:用於高效視頻理解的狀態空間模型
VideoMamba: State Space Model for Efficient Video Understanding
March 11, 2024
作者: Kunchang Li, Xinhao Li, Yi Wang, Yinan He, Yali Wang, Limin Wang, Yu Qiao
cs.AI
摘要
為了應對影片理解中的本地冗餘和全局依賴這兩大挑戰,本研究創新地將 Mamba 技術應用於影片領域。所提出的 VideoMamba 克服了現有的 3D 卷積神經網絡和影片變壓器的局限性。其具有線性複雜度的運算子實現了高效的長期建模,這對於高分辨率長影片理解至關重要。廣泛的評估揭示了 VideoMamba 的四大核心能力:(1) 在視覺領域中的可擴展性,無需進行大量數據集預訓練,這要歸功於一種新穎的自我蒸餾技術;(2) 對於識別短期動作具有敏感性,即使存在細微運動差異;(3) 在長期影片理解方面具有卓越性,展示出明顯優於傳統基於特徵的模型的進展;以及 (4) 與其他模態的兼容性,展示了在多模態情境中的穩健性。通過這些獨特優勢,VideoMamba 為影片理解設立了新的基準,提供了一個可擴展且高效的全面影片理解解決方案。所有代碼和模型均可在 https://github.com/OpenGVLab/VideoMamba 上獲得。
English
Addressing the dual challenges of local redundancy and global dependencies in
video understanding, this work innovatively adapts the Mamba to the video
domain. The proposed VideoMamba overcomes the limitations of existing 3D
convolution neural networks and video transformers. Its linear-complexity
operator enables efficient long-term modeling, which is crucial for
high-resolution long video understanding. Extensive evaluations reveal
VideoMamba's four core abilities: (1) Scalability in the visual domain without
extensive dataset pretraining, thanks to a novel self-distillation technique;
(2) Sensitivity for recognizing short-term actions even with fine-grained
motion differences; (3) Superiority in long-term video understanding,
showcasing significant advancements over traditional feature-based models; and
(4) Compatibility with other modalities, demonstrating robustness in
multi-modal contexts. Through these distinct advantages, VideoMamba sets a new
benchmark for video understanding, offering a scalable and efficient solution
for comprehensive video understanding. All the code and models are available at
https://github.com/OpenGVLab/VideoMamba.