Video Mamba Suite: 비디오 이해를 위한 다목적 대안으로서의 상태 공간 모델

초록

비디오 이해는 컴퓨터 비전 연구의 근본적인 방향 중 하나로, RNN, 3D CNN, Transformer와 같은 다양한 아키텍처를 탐구하기 위해 광범위한 노력이 기울여져 왔습니다. 최근 제안된 상태 공간 모델 아키텍처, 예를 들어 Mamba는 긴 시퀀스 모델링에서의 성공을 비디오 모델링으로 확장할 수 있는 유망한 특성을 보여줍니다. 본 연구에서는 Mamba가 비디오 이해 영역에서 Transformer의 대안으로서 가능성을 평가하기 위해, Mamba가 비디오 모델링에서 수행할 수 있는 다양한 역할을 탐구하고 Mamba가 우수성을 보일 수 있는 다양한 작업을 조사하는 포괄적인 연구를 수행합니다. 우리는 Mamba를 비디오 모델링을 위한 네 가지 역할로 분류하고, 14개의 모델/모듈로 구성된 Video Mamba Suite를 도출하여 12개의 비디오 이해 작업에서 이를 평가합니다. 광범위한 실험을 통해 Mamba가 비디오 전용 작업과 비디오-언어 작업 모두에서 강력한 잠재력을 보여주며, 효율성과 성능 간의 유망한 균형을 보여줌을 확인했습니다. 이 연구가 비디오 이해에 대한 향후 연구에 유용한 데이터 포인트와 통찰력을 제공할 수 있기를 바랍니다. 코드는 공개되어 있습니다: https://github.com/OpenGVLab/video-mamba-suite.

English

Understanding videos is one of the fundamental directions in computer vision research, with extensive efforts dedicated to exploring various architectures such as RNN, 3D CNN, and Transformers. The newly proposed architecture of state space model, e.g., Mamba, shows promising traits to extend its success in long sequence modeling to video modeling. To assess whether Mamba can be a viable alternative to Transformers in the video understanding domain, in this work, we conduct a comprehensive set of studies, probing different roles Mamba can play in modeling videos, while investigating diverse tasks where Mamba could exhibit superiority. We categorize Mamba into four roles for modeling videos, deriving a Video Mamba Suite composed of 14 models/modules, and evaluating them on 12 video understanding tasks. Our extensive experiments reveal the strong potential of Mamba on both video-only and video-language tasks while showing promising efficiency-performance trade-offs. We hope this work could provide valuable data points and insights for future research on video understanding. Code is public: https://github.com/OpenGVLab/video-mamba-suite.

Video Mamba Suite: 비디오 이해를 위한 다목적 대안으로서의 상태 공간 모델

Video Mamba Suite: State Space Model as a Versatile Alternative for Video Understanding

초록

Support