EfficientVMamba:用於輕量級視覺 Mamba 的 Atrous 選擇性掃描
EfficientVMamba: Atrous Selective Scan for Light Weight Visual Mamba
March 15, 2024
作者: Xiaohuan Pei, Tao Huang, Chang Xu
cs.AI
摘要
過去在輕量級模型開發方面的努力主要集中在卷積神經網絡(CNN)和基於Transformer的設計,但仍面臨持續挑戰。CNN擅長於局部特徵提取,但會降低解析度,而Transformer具有全局範圍,但會增加計算需求O(N^2)。準確性和效率之間的這種持續權衡仍然是一個重要障礙。最近,狀態空間模型(SSMs),如Mamba,在語言建模和計算機視覺等各種任務中展現出優異的性能和競爭力,同時將全局信息提取的時間複雜度降低到O(N)。受此啟發,本研究提議探索視覺狀態空間模型在輕量級模型設計中的潛力,並引入一種名為EfficientVMamba的新型高效模型變體。具體而言,我們的EfficientVMamba通過高效的跳躍採樣集成了基於atrous的選擇性掃描方法,構成了旨在利用全局和局部表徵特徵的構建塊。此外,我們研究了SSM塊和卷積之間的整合,並引入了一個高效的視覺狀態空間塊,結合了額外的卷積分支,進一步提升了模型性能。實驗結果表明,EfficientVMamba降低了計算複雜度,同時在各種視覺任務中取得了有競爭力的結果。例如,我們的EfficientVMamba-S具有1.3G FLOPs,在ImageNet上將Vim-Ti的1.5G FLOPs的準確率大幅提高了5.6%。代碼可在以下網址找到:https://github.com/TerryPei/EfficientVMamba。
English
Prior efforts in light-weight model development mainly centered on CNN and
Transformer-based designs yet faced persistent challenges. CNNs adept at local
feature extraction compromise resolution while Transformers offer global reach
but escalate computational demands O(N^2). This ongoing trade-off
between accuracy and efficiency remains a significant hurdle. Recently, state
space models (SSMs), such as Mamba, have shown outstanding performance and
competitiveness in various tasks such as language modeling and computer vision,
while reducing the time complexity of global information extraction to
O(N). Inspired by this, this work proposes to explore the potential
of visual state space models in light-weight model design and introduce a novel
efficient model variant dubbed EfficientVMamba. Concretely, our EfficientVMamba
integrates a atrous-based selective scan approach by efficient skip sampling,
constituting building blocks designed to harness both global and local
representational features. Additionally, we investigate the integration between
SSM blocks and convolutions, and introduce an efficient visual state space
block combined with an additional convolution branch, which further elevate the
model performance. Experimental results show that, EfficientVMamba scales down
the computational complexity while yields competitive results across a variety
of vision tasks. For example, our EfficientVMamba-S with 1.3G FLOPs improves
Vim-Ti with 1.5G FLOPs by a large margin of 5.6% accuracy on ImageNet.
Code is available at: https://github.com/TerryPei/EfficientVMamba.Summary
AI-Generated Summary