利用多模态进行任意目标分割
Segment Anything with Multiple Modalities
August 17, 2024
作者: Aoran Xiao, Weihao Xuan, Heli Qi, Yun Xing, Naoto Yokoya, Shijian Lu
cs.AI
摘要
在各种视觉识别和导航任务中,场景的鲁棒准确分割已成为一项核心功能。这激发了最近对“Segment Anything Model”(SAM)的开发,SAM是一种通用掩模分割的基础模型。然而,SAM主要针对单模态RGB图像进行了定制,从而限制了其适用性,无法处理使用广泛采用的传感器套件(如LiDAR加RGB、深度加RGB、热像加RGB等)捕获的多模态数据。我们开发了MM-SAM,这是SAM的扩展和拓展,支持跨模态和多模态处理,实现了对不同传感器套件的鲁棒增强分割。MM-SAM具有两个关键设计,即无监督跨模态转移和弱监督多模态融合,实现了对各种传感器模态的标签高效和参数高效适应。它解决了三个主要挑战:1)针对多样化的非RGB传感器进行单模态处理的适应,2)通过传感器融合协同处理多模态数据,3)为不同下游任务进行无掩模训练。大量实验证明,MM-SAM在各种传感器和数据模态下始终大幅优于SAM,展示了其在各种传感器和数据模态下的有效性和鲁棒性。
English
Robust and accurate segmentation of scenes has become one core functionality
in various visual recognition and navigation tasks. This has inspired the
recent development of Segment Anything Model (SAM), a foundation model for
general mask segmentation. However, SAM is largely tailored for single-modal
RGB images, limiting its applicability to multi-modal data captured with
widely-adopted sensor suites, such as LiDAR plus RGB, depth plus RGB, thermal
plus RGB, etc. We develop MM-SAM, an extension and expansion of SAM that
supports cross-modal and multi-modal processing for robust and enhanced
segmentation with different sensor suites. MM-SAM features two key designs,
namely, unsupervised cross-modal transfer and weakly-supervised multi-modal
fusion, enabling label-efficient and parameter-efficient adaptation toward
various sensor modalities. It addresses three main challenges: 1) adaptation
toward diverse non-RGB sensors for single-modal processing, 2) synergistic
processing of multi-modal data via sensor fusion, and 3) mask-free training for
different downstream tasks. Extensive experiments show that MM-SAM consistently
outperforms SAM by large margins, demonstrating its effectiveness and
robustness across various sensors and data modalities.