Sigma：用于多模态语义分割的连体曼巴网络

摘要

多模态语义分割显著增强了人工智能代理的感知和场景理解能力，尤其是在低光或过曝等恶劣环境下。利用额外的模态（X模态）如热像和深度，与传统的RGB一起提供互补信息，使分割更加健壮可靠。在这项工作中，我们引入了Sigma，一个用于多模态语义分割的Siamese Mamba网络，利用选择性结构状态空间模型Mamba。与依赖于具有有限局部感受野的CNN或以二次复杂度为代价提供全局感受野的视觉Transformer（ViTs）的传统方法不同，我们的模型以线性复杂度实现了全局感受野覆盖。通过采用Siamese编码器并创新Mamba融合机制，我们有效地从不同模态中选择关键信息。然后开发了一个解码器来增强模型的通道建模能力。我们的方法Sigma在RGB-热像和RGB-深度分割任务上经过严格评估，展示了其优越性，并标志着状态空间模型（SSMs）在多模态感知任务中的首次成功应用。代码可在https://github.com/zifuwan/Sigma找到。

English

Multi-modal semantic segmentation significantly enhances AI agents' perception and scene understanding, especially under adverse conditions like low-light or overexposed environments. Leveraging additional modalities (X-modality) like thermal and depth alongside traditional RGB provides complementary information, enabling more robust and reliable segmentation. In this work, we introduce Sigma, a Siamese Mamba network for multi-modal semantic segmentation, utilizing the Selective Structured State Space Model, Mamba. Unlike conventional methods that rely on CNNs, with their limited local receptive fields, or Vision Transformers (ViTs), which offer global receptive fields at the cost of quadratic complexity, our model achieves global receptive fields coverage with linear complexity. By employing a Siamese encoder and innovating a Mamba fusion mechanism, we effectively select essential information from different modalities. A decoder is then developed to enhance the channel-wise modeling ability of the model. Our method, Sigma, is rigorously evaluated on both RGB-Thermal and RGB-Depth segmentation tasks, demonstrating its superiority and marking the first successful application of State Space Models (SSMs) in multi-modal perception tasks. Code is available at https://github.com/zifuwan/Sigma.

Sigma：用于多模态语义分割的连体曼巴网络

Sigma: Siamese Mamba Network for Multi-Modal Semantic Segmentation

摘要

Support