MambaMixer：高效选择性状态空间模型与双重令牌和通道选择

摘要

深度学习的最新进展主要依赖于Transformer，因其数据依赖性和大规模学习能力。然而，这些架构中的注意力模块在输入规模上表现出二次时间与空间复杂度，限制了其在长序列建模中的可扩展性。尽管近期尝试设计针对多维数据（如图像和多元时间序列）的高效且有效的架构骨干，现有模型要么数据独立，要么未能实现维度间与维度内的通信。最近，状态空间模型（SSMs），特别是具有高效硬件感知实现的选择性状态空间模型，显示出在长序列建模中的巨大潜力。受SSMs成功的启发，我们提出了MambaMixer，一种采用数据依赖权重的新架构，通过跨标记和通道的双重选择机制，称为选择性标记与通道混合器。MambaMixer通过加权平均机制连接选择性混合器，使各层能直接访问早期特征。作为概念验证，我们基于MambaMixer模块设计了Vision MambaMixer（ViM2）和Time Series MambaMixer（TSM2）架构，并在多种视觉和时间序列预测任务中探索其性能。我们的结果强调了在标记和通道间进行选择性混合的重要性。在ImageNet分类、目标检测和语义分割任务中，ViM2与成熟的视觉模型表现相当，并优于基于SSM的视觉模型。在时间序列预测中，TSM2相比最先进的方法表现出色，同时显著提升了计算成本。这些结果表明，尽管Transformer、跨通道注意力和MLP在时间序列预测中足以实现良好性能，但并非必要。

English

Recent advances in deep learning have mainly relied on Transformers due to their data dependency and ability to learn at scale. The attention module in these architectures, however, exhibits quadratic time and space in input size, limiting their scalability for long-sequence modeling. Despite recent attempts to design efficient and effective architecture backbone for multi-dimensional data, such as images and multivariate time series, existing models are either data independent, or fail to allow inter- and intra-dimension communication. Recently, State Space Models (SSMs), and more specifically Selective State Space Models, with efficient hardware-aware implementation, have shown promising potential for long sequence modeling. Motivated by the success of SSMs, we present MambaMixer, a new architecture with data-dependent weights that uses a dual selection mechanism across tokens and channels, called Selective Token and Channel Mixer. MambaMixer connects selective mixers using a weighted averaging mechanism, allowing layers to have direct access to early features. As a proof of concept, we design Vision MambaMixer (ViM2) and Time Series MambaMixer (TSM2) architectures based on the MambaMixer block and explore their performance in various vision and time series forecasting tasks. Our results underline the importance of selective mixing across both tokens and channels. In ImageNet classification, object detection, and semantic segmentation tasks, ViM2 achieves competitive performance with well-established vision models and outperforms SSM-based vision models. In time series forecasting, TSM2 achieves outstanding performance compared to state-of-the-art methods while demonstrating significantly improved computational cost. These results show that while Transformers, cross-channel attention, and MLPs are sufficient for good performance in time series forecasting, neither is necessary.

MambaMixer：高效选择性状态空间模型与双重令牌和通道选择

MambaMixer: Efficient Selective State Space Models with Dual Token and Channel Selection

摘要

Support