ChatPaper.aiChatPaper

多模態下的任務分割

Segment Anything with Multiple Modalities

August 17, 2024
作者: Aoran Xiao, Weihao Xuan, Heli Qi, Yun Xing, Naoto Yokoya, Shijian Lu
cs.AI

摘要

在各種視覺識別和導航任務中,對場景進行強大且準確的分割已成為一項核心功能。這激發了最近開發的「Segment Anything Model」(SAM),這是一個通用遮罩分割的基礎模型。然而,SAM 主要針對單模式 RGB 圖像進行了定制,限制了其對使用廣泛的傳感器套件(如 LiDAR 加 RGB、深度加 RGB、熱像加 RGB 等)捕獲的多模式數據的適用性。我們開發了 MM-SAM,這是 SAM 的擴展和擴展,支持跨模式和多模式處理,以實現對不同傳感器套件進行強大且增強的分割。MM-SAM 具有兩個關鍵設計,即無監督跨模式轉移和弱監督多模式融合,實現了對各種傳感器模式的標籤高效和參數高效的適應。它解決了三個主要挑戰:1)對單模式處理的多樣非 RGB 傳感器進行適應,2)通過傳感器融合協同處理多模式數據,以及 3)針對不同下游任務進行無遮罩訓練。大量實驗表明,MM-SAM 在各種傳感器和數據模式下始終以較大的優勢優於 SAM,展示了其在各種傳感器和數據模式下的有效性和韌性。
English
Robust and accurate segmentation of scenes has become one core functionality in various visual recognition and navigation tasks. This has inspired the recent development of Segment Anything Model (SAM), a foundation model for general mask segmentation. However, SAM is largely tailored for single-modal RGB images, limiting its applicability to multi-modal data captured with widely-adopted sensor suites, such as LiDAR plus RGB, depth plus RGB, thermal plus RGB, etc. We develop MM-SAM, an extension and expansion of SAM that supports cross-modal and multi-modal processing for robust and enhanced segmentation with different sensor suites. MM-SAM features two key designs, namely, unsupervised cross-modal transfer and weakly-supervised multi-modal fusion, enabling label-efficient and parameter-efficient adaptation toward various sensor modalities. It addresses three main challenges: 1) adaptation toward diverse non-RGB sensors for single-modal processing, 2) synergistic processing of multi-modal data via sensor fusion, and 3) mask-free training for different downstream tasks. Extensive experiments show that MM-SAM consistently outperforms SAM by large margins, demonstrating its effectiveness and robustness across various sensors and data modalities.

Summary

AI-Generated Summary

PDF232November 19, 2024