MambaMixer: デュアルトークンとチャネル選択を備えた効率的な選択的状態空間モデル

要旨

深層学習の最近の進歩は、主にTransformerに依存しており、そのデータ依存性と大規模学習能力が理由です。しかし、これらのアーキテクチャにおけるアテンションモジュールは、入力サイズに対して二次的な時間と空間を要するため、長系列モデリングにおけるスケーラビリティが制限されています。画像や多変量時系列データなどの多次元データに対して効率的で効果的なアーキテクチャバックボーンを設計する最近の試みにもかかわらず、既存のモデルはデータに依存しないか、次元間および次元内の通信を許可できていません。最近、状態空間モデル（SSMs）、特に選択的状態空間モデルが、効率的なハードウェア対応の実装により、長系列モデリングにおいて有望な可能性を示しています。SSMsの成功に触発され、我々はMambaMixerを提案します。これは、トークンとチャネルにわたるデュアル選択メカニズムを使用するデータ依存の重みを持つ新しいアーキテクチャで、選択的トークンとチャネルミキサーと呼ばれます。MambaMixerは、重み付き平均メカニズムを使用して選択的ミキサーを接続し、レイヤーが早期の特徴に直接アクセスできるようにします。概念実証として、MambaMixerブロックに基づいてVision MambaMixer（ViM2）とTime Series MambaMixer（TSM2）アーキテクチャを設計し、さまざまな視覚および時系列予測タスクにおける性能を探ります。我々の結果は、トークンとチャネルの両方にわたる選択的ミキシングの重要性を強調しています。ImageNet分類、物体検出、セマンティックセグメンテーションタスクにおいて、ViM2は確立された視覚モデルと競争力のある性能を達成し、SSMベースの視覚モデルを上回ります。時系列予測において、TSM2は最先端の方法と比較して優れた性能を達成し、計算コストの大幅な改善を示します。これらの結果は、Transformer、クロスチャネルアテンション、およびMLPが時系列予測において良好な性能を達成するのに十分であるが、いずれも必要ではないことを示しています。

English

Recent advances in deep learning have mainly relied on Transformers due to their data dependency and ability to learn at scale. The attention module in these architectures, however, exhibits quadratic time and space in input size, limiting their scalability for long-sequence modeling. Despite recent attempts to design efficient and effective architecture backbone for multi-dimensional data, such as images and multivariate time series, existing models are either data independent, or fail to allow inter- and intra-dimension communication. Recently, State Space Models (SSMs), and more specifically Selective State Space Models, with efficient hardware-aware implementation, have shown promising potential for long sequence modeling. Motivated by the success of SSMs, we present MambaMixer, a new architecture with data-dependent weights that uses a dual selection mechanism across tokens and channels, called Selective Token and Channel Mixer. MambaMixer connects selective mixers using a weighted averaging mechanism, allowing layers to have direct access to early features. As a proof of concept, we design Vision MambaMixer (ViM2) and Time Series MambaMixer (TSM2) architectures based on the MambaMixer block and explore their performance in various vision and time series forecasting tasks. Our results underline the importance of selective mixing across both tokens and channels. In ImageNet classification, object detection, and semantic segmentation tasks, ViM2 achieves competitive performance with well-established vision models and outperforms SSM-based vision models. In time series forecasting, TSM2 achieves outstanding performance compared to state-of-the-art methods while demonstrating significantly improved computational cost. These results show that while Transformers, cross-channel attention, and MLPs are sufficient for good performance in time series forecasting, neither is necessary.

MambaMixer: デュアルトークンとチャネル選択を備えた効率的な選択的状態空間モデル

MambaMixer: Efficient Selective State Space Models with Dual Token and Channel Selection

要旨

Support