Vision Mamba:具雙向狀態空間模型的高效視覺表示學習
Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model
January 17, 2024
作者: Lianghui Zhu, Bencheng Liao, Qian Zhang, Xinlong Wang, Wenyu Liu, Xinggang Wang
cs.AI
摘要
最近,具有高效硬體感知設計的狀態空間模型(SSMs),即Mamba,展現了在長序列建模方面的巨大潛力。純粹基於SSMs建構高效且通用的視覺骨幹是一個吸引人的方向。然而,由於視覺數據的位置敏感性和對於視覺理解的全局上下文要求,對於SSMs來說,代表視覺數據是具有挑戰性的。在本文中,我們展示了視覺表示學習對自我注意力的依賴並非必要,並提出了一種具有雙向Mamba塊(Vim)的新通用視覺骨幹,它使用位置嵌入標記圖像序列並利用雙向狀態空間模型壓縮視覺表示。在ImageNet分類、COCO物體檢測和ADE20k語義分割任務中,Vim相較於DeiT等成熟的視覺Transformer實現了更高的性能,同時也顯著提高了計算和記憶體效率。例如,當對具有1248x1248分辨率的圖像進行批次推斷以提取特徵時,Vim比DeiT快2.8倍,節省了86.8%的GPU記憶體。結果表明,Vim能夠克服在高分辨率圖像上執行類Transformer理解所面臨的計算和記憶體限制,並且具有成為視覺基礎模型下一代骨幹的巨大潛力。程式碼可在https://github.com/hustvl/Vim找到。
English
Recently the state space models (SSMs) with efficient hardware-aware designs,
i.e., Mamba, have shown great potential for long sequence modeling. Building
efficient and generic vision backbones purely upon SSMs is an appealing
direction. However, representing visual data is challenging for SSMs due to the
position-sensitivity of visual data and the requirement of global context for
visual understanding. In this paper, we show that the reliance of visual
representation learning on self-attention is not necessary and propose a new
generic vision backbone with bidirectional Mamba blocks (Vim), which marks the
image sequences with position embeddings and compresses the visual
representation with bidirectional state space models. On ImageNet
classification, COCO object detection, and ADE20k semantic segmentation tasks,
Vim achieves higher performance compared to well-established vision
transformers like DeiT, while also demonstrating significantly improved
computation & memory efficiency. For example, Vim is 2.8times faster than
DeiT and saves 86.8% GPU memory when performing batch inference to extract
features on images with a resolution of 1248times1248. The results
demonstrate that Vim is capable of overcoming the computation & memory
constraints on performing Transformer-style understanding for high-resolution
images and it has great potential to become the next-generation backbone for
vision foundation models. Code is available at https://github.com/hustvl/Vim.