LocalMamba：具有窗口式選擇掃描的視覺狀態空間模型

摘要

最近在狀態空間模型方面的進展，尤其是Mamba，已經展示出在長序列建模方面取得了顯著進展，例如語言理解任務。然而，它們在視覺任務中的應用並未明顯超越傳統的卷積神經網絡（CNNs）和視覺Transformer（ViTs）的性能。本文認為增強Vision Mamba（ViM）的關鍵在於優化序列建模的掃描方向。傳統的ViM方法將空間標記展平，忽略了保留局部2D依賴性，從而拉長了相鄰標記之間的距離。我們引入了一種新穎的局部掃描策略，將圖像劃分為不同的窗口，有效捕捉局部依賴性同時保持全局視角。此外，我們認識到在不同網絡層之間掃描模式的變化偏好，提出了一種動態方法，獨立搜索每一層的最佳掃描選擇，從而顯著提高性能。在普通和分層模型上進行的大量實驗突顯了我們方法在有效捕捉圖像表示方面的優越性。例如，我們的模型在ImageNet上的性能比Vim-Ti高出3.1％，並且具有相同的1.5G FLOPs。代碼可在以下鏈接找到：https://github.com/hunto/LocalMamba。

English

Recent advancements in state space models, notably Mamba, have demonstrated significant progress in modeling long sequences for tasks like language understanding. Yet, their application in vision tasks has not markedly surpassed the performance of traditional Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs). This paper posits that the key to enhancing Vision Mamba (ViM) lies in optimizing scan directions for sequence modeling. Traditional ViM approaches, which flatten spatial tokens, overlook the preservation of local 2D dependencies, thereby elongating the distance between adjacent tokens. We introduce a novel local scanning strategy that divides images into distinct windows, effectively capturing local dependencies while maintaining a global perspective. Additionally, acknowledging the varying preferences for scan patterns across different network layers, we propose a dynamic method to independently search for the optimal scan choices for each layer, substantially improving performance. Extensive experiments across both plain and hierarchical models underscore our approach's superiority in effectively capturing image representations. For example, our model significantly outperforms Vim-Ti by 3.1% on ImageNet with the same 1.5G FLOPs. Code is available at: https://github.com/hunto/LocalMamba.

LocalMamba：具有窗口式選擇掃描的視覺狀態空間模型

LocalMamba: Visual State Space Model with Windowed Selective Scan

摘要

Summary

Support

Support