VMamba:視覺狀態空間模型
VMamba: Visual State Space Model
January 18, 2024
作者: Yue Liu, Yunjie Tian, Yuzhong Zhao, Hongtian Yu, Lingxi Xie, Yaowei Wang, Qixiang Ye, Yunfan Liu
cs.AI
摘要
卷積神經網絡(CNNs)和視覺Transformer(ViTs)被視為視覺表示學習的兩個最受歡迎的基礎模型。雖然CNNs在圖像分辨率方面表現出卓越的可擴展性,具有線性複雜度,ViTs在擬合能力上超越它們,儘管面臨二次複雜度的挑戰。更仔細的檢查顯示,ViTs通過整合全局感受域和動態權重實現了優越的視覺建模性能。這一觀察結果激勵我們提出一種新穎的架構,繼承了這些組成部分,同時增強了計算效率。為此,我們從最近引入的狀態空間模型中汲取靈感,提出了視覺狀態空間模型(VMamba),實現了線性複雜度,同時不會犧牲全局感受域。為了解決遇到的方向敏感問題,我們引入了交叉掃描模塊(CSM)來遍歷空間域,將任何非因果視覺圖像轉換為有序的補丁序列。廣泛的實驗結果證實,VMamba不僅在各種視覺感知任務中展現出有前途的能力,而且隨著圖像分辨率的提高,也比已建立的基準顯示出更為明顯的優勢。源代碼可在https://github.com/MzeroMiko/VMamba找到。
English
Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) stand as
the two most popular foundation models for visual representation learning.
While CNNs exhibit remarkable scalability with linear complexity w.r.t. image
resolution, ViTs surpass them in fitting capabilities despite contending with
quadratic complexity. A closer inspection reveals that ViTs achieve superior
visual modeling performance through the incorporation of global receptive
fields and dynamic weights. This observation motivates us to propose a novel
architecture that inherits these components while enhancing computational
efficiency. To this end, we draw inspiration from the recently introduced state
space model and propose the Visual State Space Model (VMamba), which achieves
linear complexity without sacrificing global receptive fields. To address the
encountered direction-sensitive issue, we introduce the Cross-Scan Module (CSM)
to traverse the spatial domain and convert any non-causal visual image into
order patch sequences. Extensive experimental results substantiate that VMamba
not only demonstrates promising capabilities across various visual perception
tasks, but also exhibits more pronounced advantages over established benchmarks
as the image resolution increases. Source code has been available at
https://github.com/MzeroMiko/VMamba.