Sigma:用於多模態語義分割的連體瑪巴網絡
Sigma: Siamese Mamba Network for Multi-Modal Semantic Segmentation
April 5, 2024
作者: Zifu Wan, Yuhao Wang, Silong Yong, Pingping Zhang, Simon Stepputtis, Katia Sycara, Yaqi Xie
cs.AI
摘要
多模式語義分割顯著增強人工智慧代理的感知和場景理解,尤其在低光照或曝光過度等不利條件下。利用額外的模式(X模式)如熱像和深度與傳統的RGB一起提供補充信息,使分割更具魯棒性和可靠性。在這項工作中,我們介紹了Sigma,一個用於多模式語義分割的Siamese Mamba網絡,利用了選擇性結構化狀態空間模型Mamba。與依賴於具有有限局部感受野的CNN或提供全局感受野但以二次複雜度為代價的Vision Transformers(ViTs)不同,我們的模型實現了具有線性複雜度的全局感受野覆蓋。通過使用Siamese編碼器並創新地運用Mamba融合機制,我們有效地從不同模式中選擇關鍵信息。然後開發了一個解碼器來增強模型的通道建模能力。我們的方法Sigma在RGB-熱像和RGB-深度分割任務上進行了嚴格評估,展示了其優越性,並標誌著在多模式感知任務中首次成功應用狀態空間模型(SSMs)。代碼可在https://github.com/zifuwan/Sigma找到。
English
Multi-modal semantic segmentation significantly enhances AI agents'
perception and scene understanding, especially under adverse conditions like
low-light or overexposed environments. Leveraging additional modalities
(X-modality) like thermal and depth alongside traditional RGB provides
complementary information, enabling more robust and reliable segmentation. In
this work, we introduce Sigma, a Siamese Mamba network for multi-modal semantic
segmentation, utilizing the Selective Structured State Space Model, Mamba.
Unlike conventional methods that rely on CNNs, with their limited local
receptive fields, or Vision Transformers (ViTs), which offer global receptive
fields at the cost of quadratic complexity, our model achieves global receptive
fields coverage with linear complexity. By employing a Siamese encoder and
innovating a Mamba fusion mechanism, we effectively select essential
information from different modalities. A decoder is then developed to enhance
the channel-wise modeling ability of the model. Our method, Sigma, is
rigorously evaluated on both RGB-Thermal and RGB-Depth segmentation tasks,
demonstrating its superiority and marking the first successful application of
State Space Models (SSMs) in multi-modal perception tasks. Code is available at
https://github.com/zifuwan/Sigma.Summary
AI-Generated Summary