VSSD：具有非休闲状态空间对偶的视觉曼巴

摘要

视觉Transformer已经极大地推动了计算机视觉领域的发展，提供了强大的建模能力和全局感受野。然而，它们高昂的计算需求限制了它们在处理长序列方面的适用性。为了解决这一问题，状态空间模型（SSMs）在视觉任务中备受关注，因为它们提供了线性计算复杂度。最近，在Mamba2中引入了状态空间对偶（SSD），这是SSMs的改进变体，旨在增强模型性能和效率。然而，SSD/SSMs固有的因果关系特性限制了它们在非因果视觉任务中的应用。为了解决这一局限，我们引入了视觉状态空间对偶（VSSD）模型，它具有SSD的非因果格式。具体来说，我们建议舍弃隐藏状态与标记之间的相互作用的大小，同时保留它们的相对权重，从而减轻了标记贡献对先前标记的依赖性。结合多次扫描策略，我们展示了扫描结果可以被整合以实现非因果性，这不仅提高了SSD在视觉任务中的性能，还增强了其效率。我们在包括图像分类、检测和分割在内的各种基准测试上进行了大量实验，结果显示VSSD超越了现有的基于SSM的最先进模型。代码和权重可在https://github.com/YuHengsss/VSSD 获取。

English

Vision transformers have significantly advanced the field of computer vision, offering robust modeling capabilities and global receptive field. However, their high computational demands limit their applicability in processing long sequences. To tackle this issue, State Space Models (SSMs) have gained prominence in vision tasks as they offer linear computational complexity. Recently, State Space Duality (SSD), an improved variant of SSMs, was introduced in Mamba2 to enhance model performance and efficiency. However, the inherent causal nature of SSD/SSMs restricts their applications in non-causal vision tasks. To address this limitation, we introduce Visual State Space Duality (VSSD) model, which has a non-causal format of SSD. Specifically, we propose to discard the magnitude of interactions between the hidden state and tokens while preserving their relative weights, which relieves the dependencies of token contribution on previous tokens. Together with the involvement of multi-scan strategies, we show that the scanning results can be integrated to achieve non-causality, which not only improves the performance of SSD in vision tasks but also enhances its efficiency. We conduct extensive experiments on various benchmarks including image classification, detection, and segmentation, where VSSD surpasses existing state-of-the-art SSM-based models. Code and weights are available at https://github.com/YuHengsss/VSSD.

VSSD：具有非休闲状态空间对偶的视觉曼巴

VSSD: Vision Mamba with Non-Casual State Space Duality

摘要

Support