VMamba：视觉状态空间模型

摘要

卷积神经网络（CNNs）和视觉Transformer（ViTs）被视为视觉表示学习中两种最流行的基础模型。虽然CNNs在图像分辨率方面表现出线性复杂度的出色可扩展性，但ViTs在拟合能力上超越了它们，尽管受到二次复杂度的挑战。更仔细的观察揭示了ViTs通过整合全局感受野和动态权重实现了卓越的视觉建模性能。这一观察结果激发了我们提出一种新颖的架构，该架构继承了这些组件，同时增强了计算效率。为此，我们从最近引入的状态空间模型中汲取灵感，提出了视觉状态空间模型（VMamba），其实现了线性复杂度，同时不牺牲全局感受野。为了解决遇到的方向敏感问题，我们引入了交叉扫描模块（CSM）来遍历空间域，并将任何非因果视觉图像转换为有序的补丁序列。大量实验结果证实，VMamba不仅在各种视觉感知任务中展现出有希望的能力，而且随着图像分辨率的提高，也比已建立的基准表现出更为显著的优势。源代码可在https://github.com/MzeroMiko/VMamba找到。

English

Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) stand as the two most popular foundation models for visual representation learning. While CNNs exhibit remarkable scalability with linear complexity w.r.t. image resolution, ViTs surpass them in fitting capabilities despite contending with quadratic complexity. A closer inspection reveals that ViTs achieve superior visual modeling performance through the incorporation of global receptive fields and dynamic weights. This observation motivates us to propose a novel architecture that inherits these components while enhancing computational efficiency. To this end, we draw inspiration from the recently introduced state space model and propose the Visual State Space Model (VMamba), which achieves linear complexity without sacrificing global receptive fields. To address the encountered direction-sensitive issue, we introduce the Cross-Scan Module (CSM) to traverse the spatial domain and convert any non-causal visual image into order patch sequences. Extensive experimental results substantiate that VMamba not only demonstrates promising capabilities across various visual perception tasks, but also exhibits more pronounced advantages over established benchmarks as the image resolution increases. Source code has been available at https://github.com/MzeroMiko/VMamba.

VMamba：视觉状态空间模型

VMamba: Visual State Space Model

摘要

Support