LocalMamba: 윈도우 기반 선택적 스캔을 적용한 시각적 상태 공간 모델

초록

최근 상태 공간 모델, 특히 Mamba의 발전은 언어 이해와 같은 작업에서 긴 시퀀스 모델링에 있어 상당한 진전을 보여주었습니다. 그러나 시각 작업에서의 적용은 전통적인 합성곱 신경망(CNN)과 비전 트랜스포머(ViT)의 성능을 크게 넘어서지는 못했습니다. 본 논문은 Vision Mamba(ViM)의 성능 향상의 핵심이 시퀀스 모델링을 위한 스캔 방향 최적화에 있다고 주장합니다. 공간 토큰을 평면화하는 전통적인 ViM 접근법은 지역적 2D 의존성을 보존하지 못함으로써 인접 토큰 간의 거리를 늘리는 문제가 있습니다. 우리는 이미지를 별도의 윈도우로 나누어 지역적 의존성을 효과적으로 포착하면서도 전역적 관점을 유지하는 새로운 지역 스캔 전략을 제안합니다. 또한, 다양한 네트워크 계층에서 스캔 패턴에 대한 선호도가 다르다는 점을 고려하여, 각 계층에 대해 최적의 스캔 선택을 독립적으로 탐색하는 동적 방법을 제안함으로써 성능을 크게 향상시켰습니다. 평면 및 계층적 모델 모두에 걸친 광범위한 실험을 통해 우리의 접근법이 이미지 표현을 효과적으로 포착하는 데 있어 우수함을 입증했습니다. 예를 들어, 동일한 1.5G FLOPs로 ImageNet에서 Vim-Ti보다 3.1% 더 높은 성능을 보였습니다. 코드는 https://github.com/hunto/LocalMamba에서 확인할 수 있습니다.

English

Recent advancements in state space models, notably Mamba, have demonstrated significant progress in modeling long sequences for tasks like language understanding. Yet, their application in vision tasks has not markedly surpassed the performance of traditional Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs). This paper posits that the key to enhancing Vision Mamba (ViM) lies in optimizing scan directions for sequence modeling. Traditional ViM approaches, which flatten spatial tokens, overlook the preservation of local 2D dependencies, thereby elongating the distance between adjacent tokens. We introduce a novel local scanning strategy that divides images into distinct windows, effectively capturing local dependencies while maintaining a global perspective. Additionally, acknowledging the varying preferences for scan patterns across different network layers, we propose a dynamic method to independently search for the optimal scan choices for each layer, substantially improving performance. Extensive experiments across both plain and hierarchical models underscore our approach's superiority in effectively capturing image representations. For example, our model significantly outperforms Vim-Ti by 3.1% on ImageNet with the same 1.5G FLOPs. Code is available at: https://github.com/hunto/LocalMamba.

LocalMamba: 윈도우 기반 선택적 스캔을 적용한 시각적 상태 공간 모델

LocalMamba: Visual State Space Model with Windowed Selective Scan

초록

Summary

Support

Support