EfficientVMamba: Atrous Selectieve Scan voor Lichtgewicht Visuele Mamba

Samenvatting

Eerdere inspanningen in de ontwikkeling van lichtgewicht modellen richtten zich voornamelijk op CNN- en Transformer-gebaseerde ontwerpen, maar stonden voor aanhoudende uitdagingen. CNN's, die bedreven zijn in het extraheren van lokale kenmerken, gaan ten koste van de resolutie, terwijl Transformers een globaal bereik bieden maar de rekenkundige eisen verhogen tot O(N^2). Deze voortdurende afweging tussen nauwkeurigheid en efficiëntie blijft een belangrijk obstakel. Recentelijk hebben state space models (SSM's), zoals Mamba, uitstekende prestaties en concurrentievermogen getoond in diverse taken zoals taalmodellering en computervisie, terwijl ze de tijdcomplexiteit van het extraheren van globale informatie terugbrengen tot O(N). Geïnspireerd door dit werk, stelt dit onderzoek voor om het potentieel van visuele state space models in lichtgewicht modelontwerp te verkennen en introduceert het een nieuwe efficiënte modelvariant genaamd EfficientVMamba. Concreet integreert onze EfficientVMamba een atrous-gebaseerde selectieve scan-aanpak door efficiënte skip sampling, waarbij bouwstenen worden ontworpen om zowel globale als lokale representatieve kenmerken te benutten. Daarnaast onderzoeken we de integratie tussen SSM-blokken en convoluties, en introduceren we een efficiënt visueel state space blok gecombineerd met een extra convolutietak, wat de modelprestaties verder verhoogt. Experimentele resultaten tonen aan dat EfficientVMamba de rekenkundige complexiteit verlaagt terwijl het competitieve resultaten oplevert in een verscheidenheid aan visuele taken. Zo verbetert onze EfficientVMamba-S met 1.3G FLOPs Vim-Ti met 1.5G FLOPs met een grote marge van 5.6% nauwkeurigheid op ImageNet. Code is beschikbaar op: https://github.com/TerryPei/EfficientVMamba.

English

Prior efforts in light-weight model development mainly centered on CNN and Transformer-based designs yet faced persistent challenges. CNNs adept at local feature extraction compromise resolution while Transformers offer global reach but escalate computational demands O(N^2). This ongoing trade-off between accuracy and efficiency remains a significant hurdle. Recently, state space models (SSMs), such as Mamba, have shown outstanding performance and competitiveness in various tasks such as language modeling and computer vision, while reducing the time complexity of global information extraction to O(N). Inspired by this, this work proposes to explore the potential of visual state space models in light-weight model design and introduce a novel efficient model variant dubbed EfficientVMamba. Concretely, our EfficientVMamba integrates a atrous-based selective scan approach by efficient skip sampling, constituting building blocks designed to harness both global and local representational features. Additionally, we investigate the integration between SSM blocks and convolutions, and introduce an efficient visual state space block combined with an additional convolution branch, which further elevate the model performance. Experimental results show that, EfficientVMamba scales down the computational complexity while yields competitive results across a variety of vision tasks. For example, our EfficientVMamba-S with 1.3G FLOPs improves Vim-Ti with 1.5G FLOPs by a large margin of 5.6% accuracy on ImageNet. Code is available at: https://github.com/TerryPei/EfficientVMamba.

EfficientVMamba: Atrous Selectieve Scan voor Lichtgewicht Visuele Mamba

EfficientVMamba: Atrous Selective Scan for Light Weight Visual Mamba

Samenvatting

Support