TransformerはSSMである：構造化状態空間双対性による一般化モデルと効率的アルゴリズム

要旨

Transformerは深層学習の言語モデリングにおける成功の主要なアーキテクチャとなってきたが、Mambaなどの状態空間モデル（SSM）が最近、小規模から中規模のスケールにおいてTransformerに匹敵するかそれ以上の性能を示すことが明らかになっている。本論文では、これらのモデル群が実際には非常に密接に関連していることを示し、構造化された半可分行列のよく研究されたクラスを様々に分解することで、SSMと注意機構の変種との間に豊かな理論的関係の枠組みを構築する。我々の状態空間双対性（SSD）フレームワークにより、Mambaの選択的SSMを改良したコア層を持つ新しいアーキテクチャ（Mamba-2）を設計することが可能となった。このアーキテクチャは2～8倍高速でありながら、言語モデリングにおいてTransformerと引き続き競争力を持ち続けている。

English

While Transformers have been the main architecture behind deep learning's success in language modeling, state-space models (SSMs) such as Mamba have recently been shown to match or outperform Transformers at small to medium scale. We show that these families of models are actually quite closely related, and develop a rich framework of theoretical connections between SSMs and variants of attention, connected through various decompositions of a well-studied class of structured semiseparable matrices. Our state space duality (SSD) framework allows us to design a new architecture (Mamba-2) whose core layer is an a refinement of Mamba's selective SSM that is 2-8X faster, while continuing to be competitive with Transformers on language modeling.

TransformerはSSMである：構造化状態空間双対性による一般化モデルと効率的アルゴリズム

Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

要旨

Support