EfficientVMamba: 경량 시각적 맘바를 위한 아트러스 선택적 스캔

초록

경량 모델 개발에 대한 기존의 노력은 주로 CNN과 Transformer 기반 설계에 초점을 맞췄지만 지속적인 어려움에 직면했습니다. CNN은 지역적 특징 추출에 능숙하지만 해상도를 희생시키는 반면, Transformer는 전역적 접근성을 제공하지만 계산 복잡도가 O(N^2)로 증가합니다. 이러한 정확도와 효율성 간의 지속적인 트레이드오프는 여전히 중요한 장애물로 남아 있습니다. 최근, Mamba와 같은 상태 공간 모델(SSM)이 언어 모델링 및 컴퓨터 비전과 같은 다양한 작업에서 뛰어난 성능과 경쟁력을 보여주며, 전역 정보 추출의 시간 복잡도를 O(N)으로 줄였습니다. 이를 영감으로, 본 연구는 시각적 상태 공간 모델의 잠재력을 경량 모델 설계에서 탐구하고 EfficientVMamba라는 새로운 효율적인 모델 변형을 소개합니다. 구체적으로, EfficientVMamba는 효율적인 스킵 샘플링을 통해 atrous 기반 선택적 스캔 접근법을 통합하여 전역 및 지역적 표현 특징을 모두 활용하도록 설계된 빌딩 블록을 구성합니다. 또한, SSM 블록과 컨볼루션의 통합을 연구하고, 추가 컨볼루션 브랜치와 결합된 효율적인 시각적 상태 공간 블록을 도입하여 모델 성능을 더욱 향상시킵니다. 실험 결과, EfficientVMamba는 계산 복잡도를 줄이면서 다양한 비전 작업에서 경쟁력 있는 결과를 보여줍니다. 예를 들어, 1.3G FLOPs의 EfficientVMamba-S는 1.5G FLOPs의 Vim-Ti보다 ImageNet에서 5.6%의 정확도로 큰 차이를 보입니다. 코드는 https://github.com/TerryPei/EfficientVMamba에서 확인할 수 있습니다.

English

Prior efforts in light-weight model development mainly centered on CNN and Transformer-based designs yet faced persistent challenges. CNNs adept at local feature extraction compromise resolution while Transformers offer global reach but escalate computational demands O(N^2). This ongoing trade-off between accuracy and efficiency remains a significant hurdle. Recently, state space models (SSMs), such as Mamba, have shown outstanding performance and competitiveness in various tasks such as language modeling and computer vision, while reducing the time complexity of global information extraction to O(N). Inspired by this, this work proposes to explore the potential of visual state space models in light-weight model design and introduce a novel efficient model variant dubbed EfficientVMamba. Concretely, our EfficientVMamba integrates a atrous-based selective scan approach by efficient skip sampling, constituting building blocks designed to harness both global and local representational features. Additionally, we investigate the integration between SSM blocks and convolutions, and introduce an efficient visual state space block combined with an additional convolution branch, which further elevate the model performance. Experimental results show that, EfficientVMamba scales down the computational complexity while yields competitive results across a variety of vision tasks. For example, our EfficientVMamba-S with 1.3G FLOPs improves Vim-Ti with 1.5G FLOPs by a large margin of 5.6% accuracy on ImageNet. Code is available at: https://github.com/TerryPei/EfficientVMamba.

EfficientVMamba: 경량 시각적 맘바를 위한 아트러스 선택적 스캔

EfficientVMamba: Atrous Selective Scan for Light Weight Visual Mamba

초록

Summary

Support

Support