VideoMamba: 効率的なビデオ理解のための状態空間モデル

要旨

ビデオ理解における局所的な冗長性とグローバルな依存関係という二重の課題に対処するため、本研究ではMambaをビデオ領域に革新的に適応させました。提案されたVideoMambaは、既存の3D畳み込みニューラルネットワークやビデオトランスフォーマーの限界を克服します。その線形複雑性の演算子により、高解像度の長尺ビデオ理解に不可欠な効率的な長期モデリングが可能となります。広範な評価により、VideoMambaの4つの核心的な能力が明らかになりました：(1) 新しい自己蒸留技術により、大規模なデータセット事前学習なしで視覚領域でのスケーラビリティを実現；(2) 微細な動作の違いがあっても短期間のアクションを認識する感度；(3) 従来の特徴ベースのモデルを大幅に上回る長期ビデオ理解の優位性；(4) 他のモダリティとの互換性を示し、マルチモーダルコンテキストでの堅牢性を実証。これらの明確な利点を通じて、VideoMambaはビデオ理解の新たなベンチマークを設定し、包括的なビデオ理解のためのスケーラブルで効率的なソリューションを提供します。すべてのコードとモデルはhttps://github.com/OpenGVLab/VideoMambaで公開されています。

English

Addressing the dual challenges of local redundancy and global dependencies in video understanding, this work innovatively adapts the Mamba to the video domain. The proposed VideoMamba overcomes the limitations of existing 3D convolution neural networks and video transformers. Its linear-complexity operator enables efficient long-term modeling, which is crucial for high-resolution long video understanding. Extensive evaluations reveal VideoMamba's four core abilities: (1) Scalability in the visual domain without extensive dataset pretraining, thanks to a novel self-distillation technique; (2) Sensitivity for recognizing short-term actions even with fine-grained motion differences; (3) Superiority in long-term video understanding, showcasing significant advancements over traditional feature-based models; and (4) Compatibility with other modalities, demonstrating robustness in multi-modal contexts. Through these distinct advantages, VideoMamba sets a new benchmark for video understanding, offering a scalable and efficient solution for comprehensive video understanding. All the code and models are available at https://github.com/OpenGVLab/VideoMamba.

VideoMamba: 効率的なビデオ理解のための状態空間モデル

VideoMamba: State Space Model for Efficient Video Understanding

要旨

Support