VideoMamba: 효율적인 비디오 이해를 위한 상태 공간 모델

초록

비디오 이해에서의 지역적 중복성과 전역적 의존성이라는 이중 과제를 해결하기 위해, 본 연구는 Mamba를 비디오 도메인에 혁신적으로 적용합니다. 제안된 VideoMamba는 기존의 3D 합성곱 신경망과 비디오 트랜스포머의 한계를 극복합니다. 선형 복잡도 연산자를 통해 효율적인 장기 모델링이 가능하며, 이는 고해상도 장기 비디오 이해에 필수적입니다. 광범위한 평가를 통해 VideoMamba의 네 가지 핵심 능력이 밝혀졌습니다: (1) 새로운 자기 증류 기술 덕분에 방대한 데이터셋 사전 학습 없이도 시각적 도메인에서의 확장성; (2) 미세한 동작 차이에도 단기 행동 인식을 위한 민감성; (3) 전통적인 특징 기반 모델을 크게 앞서는 장기 비디오 이해에서의 우수성; (4) 다중 모달리티와의 호환성으로, 다중 모달 컨텍스트에서의 견고성을 입증합니다. 이러한 독보적인 장점을 통해 VideoMamba는 비디오 이해를 위한 새로운 벤치마크를 설정하며, 포괄적인 비디오 이해를 위한 확장 가능하고 효율적인 솔루션을 제공합니다. 모든 코드와 모델은 https://github.com/OpenGVLab/VideoMamba에서 확인할 수 있습니다.

English

Addressing the dual challenges of local redundancy and global dependencies in video understanding, this work innovatively adapts the Mamba to the video domain. The proposed VideoMamba overcomes the limitations of existing 3D convolution neural networks and video transformers. Its linear-complexity operator enables efficient long-term modeling, which is crucial for high-resolution long video understanding. Extensive evaluations reveal VideoMamba's four core abilities: (1) Scalability in the visual domain without extensive dataset pretraining, thanks to a novel self-distillation technique; (2) Sensitivity for recognizing short-term actions even with fine-grained motion differences; (3) Superiority in long-term video understanding, showcasing significant advancements over traditional feature-based models; and (4) Compatibility with other modalities, demonstrating robustness in multi-modal contexts. Through these distinct advantages, VideoMamba sets a new benchmark for video understanding, offering a scalable and efficient solution for comprehensive video understanding. All the code and models are available at https://github.com/OpenGVLab/VideoMamba.

VideoMamba: 효율적인 비디오 이해를 위한 상태 공간 모델

VideoMamba: State Space Model for Efficient Video Understanding

초록

Support