VMamba：視覺狀態空間模型

摘要

卷積神經網絡（CNNs）和視覺Transformer（ViTs）被視為視覺表示學習的兩個最受歡迎的基礎模型。雖然CNNs在圖像分辨率方面表現出卓越的可擴展性，具有線性複雜度，ViTs在擬合能力上超越它們，儘管面臨二次複雜度的挑戰。更仔細的檢查顯示，ViTs通過整合全局感受域和動態權重實現了優越的視覺建模性能。這一觀察結果激勵我們提出一種新穎的架構，繼承了這些組成部分，同時增強了計算效率。為此，我們從最近引入的狀態空間模型中汲取靈感，提出了視覺狀態空間模型（VMamba），實現了線性複雜度，同時不會犧牲全局感受域。為了解決遇到的方向敏感問題，我們引入了交叉掃描模塊（CSM）來遍歷空間域，將任何非因果視覺圖像轉換為有序的補丁序列。廣泛的實驗結果證實，VMamba不僅在各種視覺感知任務中展現出有前途的能力，而且隨著圖像分辨率的提高，也比已建立的基準顯示出更為明顯的優勢。源代碼可在https://github.com/MzeroMiko/VMamba找到。

English

Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) stand as the two most popular foundation models for visual representation learning. While CNNs exhibit remarkable scalability with linear complexity w.r.t. image resolution, ViTs surpass them in fitting capabilities despite contending with quadratic complexity. A closer inspection reveals that ViTs achieve superior visual modeling performance through the incorporation of global receptive fields and dynamic weights. This observation motivates us to propose a novel architecture that inherits these components while enhancing computational efficiency. To this end, we draw inspiration from the recently introduced state space model and propose the Visual State Space Model (VMamba), which achieves linear complexity without sacrificing global receptive fields. To address the encountered direction-sensitive issue, we introduce the Cross-Scan Module (CSM) to traverse the spatial domain and convert any non-causal visual image into order patch sequences. Extensive experimental results substantiate that VMamba not only demonstrates promising capabilities across various visual perception tasks, but also exhibits more pronounced advantages over established benchmarks as the image resolution increases. Source code has been available at https://github.com/MzeroMiko/VMamba.

VMamba：視覺狀態空間模型

VMamba: Visual State Space Model

摘要

Support