트랜스포머는 SSM이다: 구조화된 상태 공간 이중성을 통한 일반화된 모델과 효율적인 알고리즘

초록

트랜스포머(Transformers)는 언어 모델링에서 딥러닝의 성공을 이끈 주요 아키텍처였지만, 최근 Mamba와 같은 상태-공간 모델(State-Space Models, SSMs)이 소규모에서 중간 규모의 작업에서 트랜스포머와 동등하거나 더 나은 성능을 보이는 것으로 나타났습니다. 우리는 이러한 모델군이 실제로 매우 밀접하게 관련되어 있음을 보여주며, 구조화된 반분리 행렬(semiseparable matrices)의 다양한 분해를 통해 SSMs와 주의력(attention) 변형 간의 이론적 연결을 풍부하게 구성한 프레임워크를 개발합니다. 우리의 상태-공간 이중성(State Space Duality, SSD) 프레임워크는 Mamba의 선택적 SSM을 개선한 새로운 아키텍처(Mamba-2)를 설계할 수 있게 해주며, 이는 핵심 레이어가 2-8배 더 빠르면서도 언어 모델링에서 트랜스포머와 경쟁력을 유지합니다.

English

While Transformers have been the main architecture behind deep learning's success in language modeling, state-space models (SSMs) such as Mamba have recently been shown to match or outperform Transformers at small to medium scale. We show that these families of models are actually quite closely related, and develop a rich framework of theoretical connections between SSMs and variants of attention, connected through various decompositions of a well-studied class of structured semiseparable matrices. Our state space duality (SSD) framework allows us to design a new architecture (Mamba-2) whose core layer is an a refinement of Mamba's selective SSM that is 2-8X faster, while continuing to be competitive with Transformers on language modeling.

트랜스포머는 SSM이다: 구조화된 상태 공간 이중성을 통한 일반화된 모델과 효율적인 알고리즘

Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

초록

Support