Video Mamba Suite: ビデオ理解のための汎用代替としての状態空間モデル

要旨

ビデオ理解はコンピュータビジョン研究における基本的な方向性の一つであり、RNN、3D CNN、Transformerなど様々なアーキテクチャの探求に多大な努力が注がれてきました。新たに提案された状態空間モデルのアーキテクチャ、例えばMambaは、長いシーケンスのモデリングでの成功をビデオモデリングに拡張する可能性を示しています。本論文では、Mambaがビデオ理解の分野においてTransformerの代替として有効かどうかを評価するため、Mambaがビデオモデリングにおいて果たすことができる異なる役割を探りつつ、Mambaが優位性を発揮する可能性のある多様なタスクを調査する包括的な研究を行いました。ビデオモデリングにおけるMambaの役割を4つに分類し、14のモデル/モジュールからなるVideo Mamba Suiteを導出し、12のビデオ理解タスクで評価しました。我々の広範な実験は、ビデオのみのタスクとビデオ言語タスクの両方においてMambaの強力な可能性を明らかにし、効率と性能のトレードオフにおいて有望な結果を示しています。この研究が、今後のビデオ理解研究にとって貴重なデータポイントと洞察を提供できることを願っています。コードは公開されています: https://github.com/OpenGVLab/video-mamba-suite。

English

Understanding videos is one of the fundamental directions in computer vision research, with extensive efforts dedicated to exploring various architectures such as RNN, 3D CNN, and Transformers. The newly proposed architecture of state space model, e.g., Mamba, shows promising traits to extend its success in long sequence modeling to video modeling. To assess whether Mamba can be a viable alternative to Transformers in the video understanding domain, in this work, we conduct a comprehensive set of studies, probing different roles Mamba can play in modeling videos, while investigating diverse tasks where Mamba could exhibit superiority. We categorize Mamba into four roles for modeling videos, deriving a Video Mamba Suite composed of 14 models/modules, and evaluating them on 12 video understanding tasks. Our extensive experiments reveal the strong potential of Mamba on both video-only and video-language tasks while showing promising efficiency-performance trade-offs. We hope this work could provide valuable data points and insights for future research on video understanding. Code is public: https://github.com/OpenGVLab/video-mamba-suite.

Video Mamba Suite: ビデオ理解のための汎用代替としての状態空間モデル

Video Mamba Suite: State Space Model as a Versatile Alternative for Video Understanding

要旨

Support