ChatPaper.aiChatPaper

Mixture-of-Mamba:利用模態感知稀疏性增強多模態狀態空間模型

Mixture-of-Mamba: Enhancing Multi-Modal State-Space Models with Modality-Aware Sparsity

January 27, 2025
作者: Weixin Liang, Junhong Shen, Genghan Zhang, Ning Dong, Luke Zettlemoyer, Lili Yu
cs.AI

摘要

狀態空間模型(SSMs)已成為序列建模中與Transformer相比的高效替代方案,但由於無法利用特定於模態的特徵,限制了它們在多模態預訓練中的性能。在這裡,我們提出了Mixture-of-Mamba,一種新穎的SSM架構,通過對Mamba塊進行特定於模態的參數化,引入模態感知的稀疏性。在Mixture-of-Transformers(W. Liang等人 arXiv:2411.04996; 2024)的基礎上,我們將模態感知的稀疏性優勢擴展到SSMs,同時保留其計算效率。我們在三個多模態預訓練設置中評估了Mixture-of-Mamba:Transfusion(交錯文本和連續圖像標記與擴散損失)、Chameleon(交錯文本和離散圖像標記)以及包含語音的擴展三模態框架。Mixture-of-Mamba在早期訓練步驟中始終達到相同的損失值,並顯著降低了計算成本。在Transfusion設置中,Mixture-of-Mamba在1.4B規模下僅使用34.76%的訓練FLOPs即實現了等效的圖像損失。在Chameleon設置中,Mixture-of-Mamba在1.4B規模下僅使用42.50%的FLOPs即達到相似的圖像損失,並在僅使用65.40%的FLOPs時達到相似的文本損失。在三模態設置中,MoM在1.4B規模下僅使用24.80%的FLOPs即達到語音損失。我們的消融研究突出了投影組件解耦的協同效應,聯合解耦產生的增益大於單獨修改。這些結果將模態感知的稀疏性確立為一種多功能且有效的設計原則,將其影響從Transformers擴展到SSMs,並在多模態預訓練中設立新的基準。我們的代碼可在https://github.com/Weixin-Liang/Mixture-of-Mamba中訪問。
English
State Space Models (SSMs) have emerged as efficient alternatives to Transformers for sequential modeling, but their inability to leverage modality-specific features limits their performance in multi-modal pretraining. Here, we propose Mixture-of-Mamba, a novel SSM architecture that introduces modality-aware sparsity through modality-specific parameterization of the Mamba block. Building on Mixture-of-Transformers (W. Liang et al. arXiv:2411.04996; 2024), we extend the benefits of modality-aware sparsity to SSMs while preserving their computational efficiency. We evaluate Mixture-of-Mamba across three multi-modal pretraining settings: Transfusion (interleaved text and continuous image tokens with diffusion loss), Chameleon (interleaved text and discrete image tokens), and an extended three-modality framework incorporating speech. Mixture-of-Mamba consistently reaches the same loss values at earlier training steps with significantly reduced computational costs. In the Transfusion setting, Mixture-of-Mamba achieves equivalent image loss using only 34.76% of the training FLOPs at the 1.4B scale. In the Chameleon setting, Mixture-of-Mamba reaches similar image loss with just 42.50% of the FLOPs at the 1.4B scale, and similar text loss with just 65.40% of the FLOPs. In the three-modality setting, MoM matches speech loss at 24.80% of the FLOPs at the 1.4B scale. Our ablation study highlights the synergistic effects of decoupling projection components, where joint decoupling yields greater gains than individual modifications. These results establish modality-aware sparsity as a versatile and effective design principle, extending its impact from Transformers to SSMs and setting new benchmarks in multi-modal pretraining. Our code can be accessed at https://github.com/Weixin-Liang/Mixture-of-Mamba

Summary

AI-Generated Summary

PDF81January 28, 2025