ChatPaper.aiChatPaper

利用多模态进行任意目标分割

Segment Anything with Multiple Modalities

August 17, 2024
作者: Aoran Xiao, Weihao Xuan, Heli Qi, Yun Xing, Naoto Yokoya, Shijian Lu
cs.AI

摘要

在各种视觉识别和导航任务中,场景的鲁棒准确分割已成为一项核心功能。这激发了最近对“Segment Anything Model”(SAM)的开发,SAM是一种通用掩模分割的基础模型。然而,SAM主要针对单模态RGB图像进行了定制,从而限制了其适用性,无法处理使用广泛采用的传感器套件(如LiDAR加RGB、深度加RGB、热像加RGB等)捕获的多模态数据。我们开发了MM-SAM,这是SAM的扩展和拓展,支持跨模态和多模态处理,实现了对不同传感器套件的鲁棒增强分割。MM-SAM具有两个关键设计,即无监督跨模态转移和弱监督多模态融合,实现了对各种传感器模态的标签高效和参数高效适应。它解决了三个主要挑战:1)针对多样化的非RGB传感器进行单模态处理的适应,2)通过传感器融合协同处理多模态数据,3)为不同下游任务进行无掩模训练。大量实验证明,MM-SAM在各种传感器和数据模态下始终大幅优于SAM,展示了其在各种传感器和数据模态下的有效性和鲁棒性。
English
Robust and accurate segmentation of scenes has become one core functionality in various visual recognition and navigation tasks. This has inspired the recent development of Segment Anything Model (SAM), a foundation model for general mask segmentation. However, SAM is largely tailored for single-modal RGB images, limiting its applicability to multi-modal data captured with widely-adopted sensor suites, such as LiDAR plus RGB, depth plus RGB, thermal plus RGB, etc. We develop MM-SAM, an extension and expansion of SAM that supports cross-modal and multi-modal processing for robust and enhanced segmentation with different sensor suites. MM-SAM features two key designs, namely, unsupervised cross-modal transfer and weakly-supervised multi-modal fusion, enabling label-efficient and parameter-efficient adaptation toward various sensor modalities. It addresses three main challenges: 1) adaptation toward diverse non-RGB sensors for single-modal processing, 2) synergistic processing of multi-modal data via sensor fusion, and 3) mask-free training for different downstream tasks. Extensive experiments show that MM-SAM consistently outperforms SAM by large margins, demonstrating its effectiveness and robustness across various sensors and data modalities.
PDF232November 19, 2024