ChatPaper.aiChatPaper

LocalMamba:带有窗口选择扫描的视觉状态空间模型

LocalMamba: Visual State Space Model with Windowed Selective Scan

March 14, 2024
作者: Tao Huang, Xiaohuan Pei, Shan You, Fei Wang, Chen Qian, Chang Xu
cs.AI

摘要

最近对状态空间模型的研究取得了显著进展,尤其是Mamba模型,在长序列建模方面展现出了重要的进展,如语言理解任务。然而,它们在视觉任务中的应用并未明显超越传统的卷积神经网络(CNNs)和视觉Transformer(ViTs)的性能。本文认为增强Vision Mamba(ViM)的关键在于优化序列建模的扫描方向。传统的ViM方法将空间标记展平,忽视了保留局部2D依赖性,从而延长了相邻标记之间的距离。我们引入了一种新颖的局部扫描策略,将图像分成不同窗口,有效捕获局部依赖性同时保持全局视角。此外,我们意识到在不同网络层之间扫描模式的偏好各不相同,因此提出了一种动态方法,独立搜索每个层的最佳扫描选择,从而显著提高性能。在普通模型和分层模型上进行的大量实验凸显了我们方法在有效捕获图像表示方面的优越性。例如,我们的模型在ImageNet上的性能比Vim-Ti高出3.1%,且计算量相同为1.5G FLOPs。代码可在以下链接找到:https://github.com/hunto/LocalMamba。
English
Recent advancements in state space models, notably Mamba, have demonstrated significant progress in modeling long sequences for tasks like language understanding. Yet, their application in vision tasks has not markedly surpassed the performance of traditional Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs). This paper posits that the key to enhancing Vision Mamba (ViM) lies in optimizing scan directions for sequence modeling. Traditional ViM approaches, which flatten spatial tokens, overlook the preservation of local 2D dependencies, thereby elongating the distance between adjacent tokens. We introduce a novel local scanning strategy that divides images into distinct windows, effectively capturing local dependencies while maintaining a global perspective. Additionally, acknowledging the varying preferences for scan patterns across different network layers, we propose a dynamic method to independently search for the optimal scan choices for each layer, substantially improving performance. Extensive experiments across both plain and hierarchical models underscore our approach's superiority in effectively capturing image representations. For example, our model significantly outperforms Vim-Ti by 3.1% on ImageNet with the same 1.5G FLOPs. Code is available at: https://github.com/hunto/LocalMamba.

Summary

AI-Generated Summary

PDF91December 15, 2024