MambaVision:一個混合Mamba-Transformer視覺主幹
MambaVision: A Hybrid Mamba-Transformer Vision Backbone
July 10, 2024
作者: Ali Hatamizadeh, Jan Kautz
cs.AI
摘要
我們提出了一種新穎的混合 Mamba-Transformer 骨幹,命名為 MambaVision,專門為視覺應用而設計。我們的核心貢獻包括重新設計 Mamba 公式,以增強其對視覺特徵進行高效建模的能力。此外,我們對將 Vision Transformers(ViT)與 Mamba 整合的可行性進行了全面的消融研究。我們的結果表明,在最終層裝備 Mamba 架構與多個自注意力塊顯著改善了建模能力,以捕捉長距離空間依賴性。基於我們的研究結果,我們引入了一系列 MambaVision 模型,具有分層架構,以滿足各種設計標準。對於在 ImageNet-1K 數據集上的圖像分類任務,MambaVision 模型變體在 Top-1 準確度和圖像吞吐量方面實現了新的最先進性能。在 MS COCO 和 ADE20K 數據集上的對象檢測、實例分割和語義分割等下游任務中,MambaVision 優於相同大小的骨幹,並展現出更有利的性能。程式碼:https://github.com/NVlabs/MambaVision。
English
We propose a novel hybrid Mamba-Transformer backbone, denoted as MambaVision,
which is specifically tailored for vision applications. Our core contribution
includes redesigning the Mamba formulation to enhance its capability for
efficient modeling of visual features. In addition, we conduct a comprehensive
ablation study on the feasibility of integrating Vision Transformers (ViT) with
Mamba. Our results demonstrate that equipping the Mamba architecture with
several self-attention blocks at the final layers greatly improves the modeling
capacity to capture long-range spatial dependencies. Based on our findings, we
introduce a family of MambaVision models with a hierarchical architecture to
meet various design criteria. For Image classification on ImageNet-1K dataset,
MambaVision model variants achieve a new State-of-the-Art (SOTA) performance in
terms of Top-1 accuracy and image throughput. In downstream tasks such as
object detection, instance segmentation and semantic segmentation on MS COCO
and ADE20K datasets, MambaVision outperforms comparably-sized backbones and
demonstrates more favorable performance. Code:
https://github.com/NVlabs/MambaVision.Summary
AI-Generated Summary