LocalMamba: ウィンドウ選択的スキャンを備えた視覚状態空間モデル

要旨

近年、状態空間モデル、特にMambaの進展により、言語理解などのタスクにおける長いシーケンスのモデリングにおいて大きな進歩が示されてきた。しかし、視覚タスクへの応用では、従来の畳み込みニューラルネットワーク（CNN）やVision Transformers（ViTs）の性能を著しく上回ることはなかった。本論文では、Vision Mamba（ViM）の性能向上の鍵は、シーケンスモデリングにおけるスキャン方向の最適化にあると主張する。従来のViMアプローチでは、空間トークンを平坦化することで、局所的な2次元依存関係の保持を見落としており、隣接トークン間の距離を長くしてしまう。我々は、画像を異なるウィンドウに分割することで、局所的な依存関係を効果的に捉えつつ、グローバルな視点を維持する新しい局所スキャン戦略を提案する。さらに、異なるネットワーク層間でスキャンパターンの好みが異なることを認識し、各層に対して最適なスキャン選択を独立して探索する動的な手法を提案し、性能を大幅に向上させる。プレーンおよび階層型モデルにおける広範な実験を通じて、我々のアプローチが画像表現を効果的に捉える優位性を実証する。例えば、同じ1.5G FLOPsで、我々のモデルはImageNetにおいてVim-Tiを3.1%上回る。コードは以下で公開されている：https://github.com/hunto/LocalMamba。

English

Recent advancements in state space models, notably Mamba, have demonstrated significant progress in modeling long sequences for tasks like language understanding. Yet, their application in vision tasks has not markedly surpassed the performance of traditional Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs). This paper posits that the key to enhancing Vision Mamba (ViM) lies in optimizing scan directions for sequence modeling. Traditional ViM approaches, which flatten spatial tokens, overlook the preservation of local 2D dependencies, thereby elongating the distance between adjacent tokens. We introduce a novel local scanning strategy that divides images into distinct windows, effectively capturing local dependencies while maintaining a global perspective. Additionally, acknowledging the varying preferences for scan patterns across different network layers, we propose a dynamic method to independently search for the optimal scan choices for each layer, substantially improving performance. Extensive experiments across both plain and hierarchical models underscore our approach's superiority in effectively capturing image representations. For example, our model significantly outperforms Vim-Ti by 3.1% on ImageNet with the same 1.5G FLOPs. Code is available at: https://github.com/hunto/LocalMamba.

LocalMamba: ウィンドウ選択的スキャンを備えた視覚状態空間モデル

LocalMamba: Visual State Space Model with Windowed Selective Scan

要旨

Support