TransMamba:靈活切換Transformer與Mamba
TransMamba: Flexibly Switching between Transformer and Mamba
March 31, 2025
作者: Yixing Li, Ruobing Xie, Zhen Yang, Xingwu Sun, Shuaipeng Li, Weidong Han, Zhanhui Kang, Yu Cheng, Chengzhong Xu, Di Wang, Jie Jiang
cs.AI
摘要
Transformer 是現代大型語言模型的基石,但其二次方的計算複雜度限制了長序列處理的效率。近期,Mamba 作為一種具有線性複雜度的狀態空間模型(SSM),在效率提升方面展現出潛力,但其上下文學習和多任務泛化能力仍存在不穩定性。本文提出 TransMamba,這是一種通過共享參數矩陣(如 QKV 和 CBx)將 Transformer 與 Mamba 統一起來的新框架,從而能夠在不同詞元長度和層級間動態切換注意力機制與 SSM 機制。我們設計了記憶轉換器,通過將注意力輸出轉換為 SSM 兼容的狀態來橋接 Transformer 和 Mamba,確保在轉換發生的 TransPoints 處實現無縫信息流。此外,我們深入探索了 TransPoint 調度策略以進一步提升性能。通過大量實驗,我們證明 TransMamba 在訓練效率和性能上均優於基準模型,並驗證了 Transformer 與 Mamba 範式之間更深層的一致性,為下一代序列建模提供了一個可擴展的解決方案。
English
Transformers are the cornerstone of modern large language models, but their
quadratic computational complexity limits efficiency in long-sequence
processing. Recent advancements in Mamba, a state space model (SSM) with linear
complexity, offer promising efficiency gains but suffer from unstable
contextual learning and multitask generalization. This paper proposes
TransMamba, a novel framework that unifies Transformer and Mamba through shared
parameter matrices (e.g., QKV and CBx), and thus could dynamically switch
between attention and SSM mechanisms at different token lengths and layers. We
design the Memory converter to bridge Transformer and Mamba by converting
attention outputs into SSM-compatible states, ensuring seamless information
flow at TransPoints where the transformation happens. The TransPoint scheduling
is also thoroughly explored for further improvements. We conducted extensive
experiments demonstrating that TransMamba achieves superior training efficiency
and performance compared to baselines, and validated the deeper consistency
between Transformer and Mamba paradigms, offering a scalable solution for
next-generation sequence modeling.Summary
AI-Generated Summary