MoE-Mamba: 전문가 혼합을 통한 효율적인 선택적 상태 공간 모델

초록

상태 공간 모델(State Space Models, SSMs)은 순차 모델링 분야에서 트랜스포머(Transformers)의 지배적 위치에 도전하는 강력한 경쟁자로 부상하고 있다. 동시에, 전문가 혼합 모델(Mixture of Experts, MoE)은 최신 오픈소스 모델을 포함한 트랜스포머 기반 대형 언어 모델(LLMs)의 성능을 크게 향상시켰다. 본 연구에서는 SSM의 확장 잠재력을 극대화하기 위해 MoE와의 결합을 제안한다. 이를 최근 SSM 기반 모델인 Mamba에 적용하여, 트랜스포머와 유사한 뛰어난 성능을 달성하는 것을 보여준다. 우리가 제안한 MoE-Mamba 모델은 Mamba와 Transformer-MoE 모두를 능가하며, 특히 Mamba와 동일한 성능을 2.2배 더 적은 학습 단계로 달성하면서도 Mamba가 트랜스포머 대비 갖는 추론 성능 향상을 유지한다.

English

State Space Models (SSMs) have become serious contenders in the field of sequential modeling, challenging the dominance of Transformers. At the same time, Mixture of Experts (MoE) has significantly improved Transformer-based LLMs, including recent state-of-the-art open-source models. We propose that to unlock the potential of SSMs for scaling, they should be combined with MoE. We showcase this on Mamba, a recent SSM-based model that achieves remarkable, Transformer-like performance. Our model, MoE-Mamba, outperforms both Mamba and Transformer-MoE. In particular, MoE-Mamba reaches the same performance as Mamba in 2.2x less training steps while preserving the inference performance gains of Mamba against the Transformer.

MoE-Mamba: 전문가 혼합을 통한 효율적인 선택적 상태 공간 모델

MoE-Mamba: Efficient Selective State Space Models with Mixture of Experts

초록

Support