MoBA:長上下文大語言模型的區塊注意力混合機制
MoBA: Mixture of Block Attention for Long-Context LLMs
February 18, 2025
作者: Enzhe Lu, Zhejun Jiang, Jingyuan Liu, Yulun Du, Tao Jiang, Chao Hong, Shaowei Liu, Weiran He, Enming Yuan, Yuzhi Wang, Zhiqi Huang, Huan Yuan, Suting Xu, Xinran Xu, Guokun Lai, Yanru Chen, Huabin Zheng, Junjie Yan, Jianlin Su, Yuxin Wu, Neo Y. Zhang, Zhilin Yang, Xinyu Zhou, Mingxing Zhang, Jiezhong Qiu
cs.AI
摘要
擴展有效上下文長度對於推動大型語言模型(LLMs)邁向人工通用智能(AGI)至關重要。然而,傳統注意力機制中固有的計算複雜度二次增長帶來了難以承受的開銷。現有方法要么施加了強烈偏置的結構,如針對特定任務的匯聚或窗口注意力,要么將注意力機制徹底修改為線性近似,而這些方法在複雜推理任務中的性能仍未被充分探索。
在本研究中,我們提出了一種遵循「更少結構」原則的解決方案,允許模型自主決定關注何處,而非引入預定義的偏置。我們引入了塊注意力混合(Mixture of Block Attention, MoBA),這是一種創新方法,將專家混合(Mixture of Experts, MoE)的原則應用於注意力機制。這一新穎架構在長上下文任務中展現出卓越性能,同時提供了一個關鍵優勢:能夠在完全注意力和稀疏注意力之間無縫切換,提升效率而不必擔心性能受損。MoBA已被部署以支持Kimi的長上下文請求,並在LLMs的高效注意力計算方面展示了顯著進展。我們的代碼可在https://github.com/MoonshotAI/MoBA 獲取。
English
Scaling the effective context length is essential for advancing large
language models (LLMs) toward artificial general intelligence (AGI). However,
the quadratic increase in computational complexity inherent in traditional
attention mechanisms presents a prohibitive overhead. Existing approaches
either impose strongly biased structures, such as sink or window attention
which are task-specific, or radically modify the attention mechanism into
linear approximations, whose performance in complex reasoning tasks remains
inadequately explored.
In this work, we propose a solution that adheres to the ``less structure''
principle, allowing the model to determine where to attend autonomously, rather
than introducing predefined biases. We introduce Mixture of Block Attention
(MoBA), an innovative approach that applies the principles of Mixture of
Experts (MoE) to the attention mechanism. This novel architecture demonstrates
superior performance on long-context tasks while offering a key advantage: the
ability to seamlessly transition between full and sparse attention, enhancing
efficiency without the risk of compromising performance. MoBA has already been
deployed to support Kimi's long-context requests and demonstrates significant
advancements in efficient attention computation for LLMs. Our code is available
at https://github.com/MoonshotAI/MoBA.Summary
AI-Generated Summary