ChatPaper.aiChatPaper

TransMamba:靈活切換Transformer與Mamba

TransMamba: Flexibly Switching between Transformer and Mamba

March 31, 2025
作者: Yixing Li, Ruobing Xie, Zhen Yang, Xingwu Sun, Shuaipeng Li, Weidong Han, Zhanhui Kang, Yu Cheng, Chengzhong Xu, Di Wang, Jie Jiang
cs.AI

摘要

Transformer 是現代大型語言模型的基石,但其二次方的計算複雜度限制了長序列處理的效率。近期,Mamba 作為一種具有線性複雜度的狀態空間模型(SSM),在效率提升方面展現出潛力,但其上下文學習和多任務泛化能力仍存在不穩定性。本文提出 TransMamba,這是一種通過共享參數矩陣(如 QKV 和 CBx)將 Transformer 與 Mamba 統一起來的新框架,從而能夠在不同詞元長度和層級間動態切換注意力機制與 SSM 機制。我們設計了記憶轉換器,通過將注意力輸出轉換為 SSM 兼容的狀態來橋接 Transformer 和 Mamba,確保在轉換發生的 TransPoints 處實現無縫信息流。此外,我們深入探索了 TransPoint 調度策略以進一步提升性能。通過大量實驗,我們證明 TransMamba 在訓練效率和性能上均優於基準模型,並驗證了 Transformer 與 Mamba 範式之間更深層的一致性,為下一代序列建模提供了一個可擴展的解決方案。
English
Transformers are the cornerstone of modern large language models, but their quadratic computational complexity limits efficiency in long-sequence processing. Recent advancements in Mamba, a state space model (SSM) with linear complexity, offer promising efficiency gains but suffer from unstable contextual learning and multitask generalization. This paper proposes TransMamba, a novel framework that unifies Transformer and Mamba through shared parameter matrices (e.g., QKV and CBx), and thus could dynamically switch between attention and SSM mechanisms at different token lengths and layers. We design the Memory converter to bridge Transformer and Mamba by converting attention outputs into SSM-compatible states, ensuring seamless information flow at TransPoints where the transformation happens. The TransPoint scheduling is also thoroughly explored for further improvements. We conducted extensive experiments demonstrating that TransMamba achieves superior training efficiency and performance compared to baselines, and validated the deeper consistency between Transformer and Mamba paradigms, offering a scalable solution for next-generation sequence modeling.

Summary

AI-Generated Summary

PDF202April 7, 2025