ChatPaper.aiChatPaper

MambaVision:一种混合Mamba-Transformer视觉骨干网络

MambaVision: A Hybrid Mamba-Transformer Vision Backbone

July 10, 2024
作者: Ali Hatamizadeh, Jan Kautz
cs.AI

摘要

我们提出了一种新颖的混合Mamba-Transformer骨干,命名为MambaVision,专门为视觉应用量身定制。我们的核心贡献包括重新设计Mamba公式,以增强其对视觉特征高效建模的能力。此外,我们对将Vision Transformers(ViT)与Mamba集成的可行性进行了全面的消融研究。我们的结果表明,在最终层将Mamba架构配备多个自注意力块,极大地提高了捕捉远程空间依赖关系的建模能力。基于我们的发现,我们引入了一系列具有分层架构的MambaVision模型,以满足各种设计标准。对于在ImageNet-1K数据集上的图像分类,MambaVision模型变体在Top-1准确率和图像吞吐量方面实现了新的最先进性能。在MS COCO和ADE20K数据集上的目标检测、实例分割和语义分割等下游任务中,MambaVision优于相同规模的骨干,并表现出更有利的性能。代码:https://github.com/NVlabs/MambaVision。
English
We propose a novel hybrid Mamba-Transformer backbone, denoted as MambaVision, which is specifically tailored for vision applications. Our core contribution includes redesigning the Mamba formulation to enhance its capability for efficient modeling of visual features. In addition, we conduct a comprehensive ablation study on the feasibility of integrating Vision Transformers (ViT) with Mamba. Our results demonstrate that equipping the Mamba architecture with several self-attention blocks at the final layers greatly improves the modeling capacity to capture long-range spatial dependencies. Based on our findings, we introduce a family of MambaVision models with a hierarchical architecture to meet various design criteria. For Image classification on ImageNet-1K dataset, MambaVision model variants achieve a new State-of-the-Art (SOTA) performance in terms of Top-1 accuracy and image throughput. In downstream tasks such as object detection, instance segmentation and semantic segmentation on MS COCO and ADE20K datasets, MambaVision outperforms comparably-sized backbones and demonstrates more favorable performance. Code: https://github.com/NVlabs/MambaVision.

Summary

AI-Generated Summary

PDF335November 28, 2024