VideoMamba：用於高效視頻理解的狀態空間模型

摘要

為了應對影片理解中的本地冗餘和全局依賴這兩大挑戰，本研究創新地將 Mamba 技術應用於影片領域。所提出的 VideoMamba 克服了現有的 3D 卷積神經網絡和影片變壓器的局限性。其具有線性複雜度的運算子實現了高效的長期建模，這對於高分辨率長影片理解至關重要。廣泛的評估揭示了 VideoMamba 的四大核心能力：(1) 在視覺領域中的可擴展性，無需進行大量數據集預訓練，這要歸功於一種新穎的自我蒸餾技術；(2) 對於識別短期動作具有敏感性，即使存在細微運動差異；(3) 在長期影片理解方面具有卓越性，展示出明顯優於傳統基於特徵的模型的進展；以及 (4) 與其他模態的兼容性，展示了在多模態情境中的穩健性。通過這些獨特優勢，VideoMamba 為影片理解設立了新的基準，提供了一個可擴展且高效的全面影片理解解決方案。所有代碼和模型均可在 https://github.com/OpenGVLab/VideoMamba 上獲得。

English

Addressing the dual challenges of local redundancy and global dependencies in video understanding, this work innovatively adapts the Mamba to the video domain. The proposed VideoMamba overcomes the limitations of existing 3D convolution neural networks and video transformers. Its linear-complexity operator enables efficient long-term modeling, which is crucial for high-resolution long video understanding. Extensive evaluations reveal VideoMamba's four core abilities: (1) Scalability in the visual domain without extensive dataset pretraining, thanks to a novel self-distillation technique; (2) Sensitivity for recognizing short-term actions even with fine-grained motion differences; (3) Superiority in long-term video understanding, showcasing significant advancements over traditional feature-based models; and (4) Compatibility with other modalities, demonstrating robustness in multi-modal contexts. Through these distinct advantages, VideoMamba sets a new benchmark for video understanding, offering a scalable and efficient solution for comprehensive video understanding. All the code and models are available at https://github.com/OpenGVLab/VideoMamba.

VideoMamba：用於高效視頻理解的狀態空間模型

VideoMamba: State Space Model for Efficient Video Understanding

摘要

Support