EfficientVMamba：用於輕量級視覺 Mamba 的 Atrous 選擇性掃描

摘要

過去在輕量級模型開發方面的努力主要集中在卷積神經網絡（CNN）和基於Transformer的設計，但仍面臨持續挑戰。CNN擅長於局部特徵提取，但會降低解析度，而Transformer具有全局範圍，但會增加計算需求O(N^2)。準確性和效率之間的這種持續權衡仍然是一個重要障礙。最近，狀態空間模型（SSMs），如Mamba，在語言建模和計算機視覺等各種任務中展現出優異的性能和競爭力，同時將全局信息提取的時間複雜度降低到O(N)。受此啟發，本研究提議探索視覺狀態空間模型在輕量級模型設計中的潛力，並引入一種名為EfficientVMamba的新型高效模型變體。具體而言，我們的EfficientVMamba通過高效的跳躍採樣集成了基於atrous的選擇性掃描方法，構成了旨在利用全局和局部表徵特徵的構建塊。此外，我們研究了SSM塊和卷積之間的整合，並引入了一個高效的視覺狀態空間塊，結合了額外的卷積分支，進一步提升了模型性能。實驗結果表明，EfficientVMamba降低了計算複雜度，同時在各種視覺任務中取得了有競爭力的結果。例如，我們的EfficientVMamba-S具有1.3G FLOPs，在ImageNet上將Vim-Ti的1.5G FLOPs的準確率大幅提高了5.6%。代碼可在以下網址找到：https://github.com/TerryPei/EfficientVMamba。

English

Prior efforts in light-weight model development mainly centered on CNN and Transformer-based designs yet faced persistent challenges. CNNs adept at local feature extraction compromise resolution while Transformers offer global reach but escalate computational demands O(N^2). This ongoing trade-off between accuracy and efficiency remains a significant hurdle. Recently, state space models (SSMs), such as Mamba, have shown outstanding performance and competitiveness in various tasks such as language modeling and computer vision, while reducing the time complexity of global information extraction to O(N). Inspired by this, this work proposes to explore the potential of visual state space models in light-weight model design and introduce a novel efficient model variant dubbed EfficientVMamba. Concretely, our EfficientVMamba integrates a atrous-based selective scan approach by efficient skip sampling, constituting building blocks designed to harness both global and local representational features. Additionally, we investigate the integration between SSM blocks and convolutions, and introduce an efficient visual state space block combined with an additional convolution branch, which further elevate the model performance. Experimental results show that, EfficientVMamba scales down the computational complexity while yields competitive results across a variety of vision tasks. For example, our EfficientVMamba-S with 1.3G FLOPs improves Vim-Ti with 1.5G FLOPs by a large margin of 5.6% accuracy on ImageNet. Code is available at: https://github.com/TerryPei/EfficientVMamba.

EfficientVMamba：用於輕量級視覺 Mamba 的 Atrous 選擇性掃描

EfficientVMamba: Atrous Selective Scan for Light Weight Visual Mamba

摘要

Summary

Support

Support