利用多模态进行任意目标分割

摘要

在各种视觉识别和导航任务中，场景的鲁棒准确分割已成为一项核心功能。这激发了最近对“Segment Anything Model”（SAM）的开发，SAM是一种通用掩模分割的基础模型。然而，SAM主要针对单模态RGB图像进行了定制，从而限制了其适用性，无法处理使用广泛采用的传感器套件（如LiDAR加RGB、深度加RGB、热像加RGB等）捕获的多模态数据。我们开发了MM-SAM，这是SAM的扩展和拓展，支持跨模态和多模态处理，实现了对不同传感器套件的鲁棒增强分割。MM-SAM具有两个关键设计，即无监督跨模态转移和弱监督多模态融合，实现了对各种传感器模态的标签高效和参数高效适应。它解决了三个主要挑战：1）针对多样化的非RGB传感器进行单模态处理的适应，2）通过传感器融合协同处理多模态数据，3）为不同下游任务进行无掩模训练。大量实验证明，MM-SAM在各种传感器和数据模态下始终大幅优于SAM，展示了其在各种传感器和数据模态下的有效性和鲁棒性。

English

Robust and accurate segmentation of scenes has become one core functionality in various visual recognition and navigation tasks. This has inspired the recent development of Segment Anything Model (SAM), a foundation model for general mask segmentation. However, SAM is largely tailored for single-modal RGB images, limiting its applicability to multi-modal data captured with widely-adopted sensor suites, such as LiDAR plus RGB, depth plus RGB, thermal plus RGB, etc. We develop MM-SAM, an extension and expansion of SAM that supports cross-modal and multi-modal processing for robust and enhanced segmentation with different sensor suites. MM-SAM features two key designs, namely, unsupervised cross-modal transfer and weakly-supervised multi-modal fusion, enabling label-efficient and parameter-efficient adaptation toward various sensor modalities. It addresses three main challenges: 1) adaptation toward diverse non-RGB sensors for single-modal processing, 2) synergistic processing of multi-modal data via sensor fusion, and 3) mask-free training for different downstream tasks. Extensive experiments show that MM-SAM consistently outperforms SAM by large margins, demonstrating its effectiveness and robustness across various sensors and data modalities.

利用多模态进行任意目标分割

Segment Anything with Multiple Modalities

摘要

Support