MoBA：長上下文大語言模型的區塊注意力混合機制

摘要

擴展有效上下文長度對於推動大型語言模型（LLMs）邁向人工通用智能（AGI）至關重要。然而，傳統注意力機制中固有的計算複雜度二次增長帶來了難以承受的開銷。現有方法要么施加了強烈偏置的結構，如針對特定任務的匯聚或窗口注意力，要么將注意力機制徹底修改為線性近似，而這些方法在複雜推理任務中的性能仍未被充分探索。在本研究中，我們提出了一種遵循「更少結構」原則的解決方案，允許模型自主決定關注何處，而非引入預定義的偏置。我們引入了塊注意力混合（Mixture of Block Attention, MoBA），這是一種創新方法，將專家混合（Mixture of Experts, MoE）的原則應用於注意力機制。這一新穎架構在長上下文任務中展現出卓越性能，同時提供了一個關鍵優勢：能夠在完全注意力和稀疏注意力之間無縫切換，提升效率而不必擔心性能受損。MoBA已被部署以支持Kimi的長上下文請求，並在LLMs的高效注意力計算方面展示了顯著進展。我們的代碼可在https://github.com/MoonshotAI/MoBA 獲取。

English

Scaling the effective context length is essential for advancing large language models (LLMs) toward artificial general intelligence (AGI). However, the quadratic increase in computational complexity inherent in traditional attention mechanisms presents a prohibitive overhead. Existing approaches either impose strongly biased structures, such as sink or window attention which are task-specific, or radically modify the attention mechanism into linear approximations, whose performance in complex reasoning tasks remains inadequately explored. In this work, we propose a solution that adheres to the ``less structure'' principle, allowing the model to determine where to attend autonomously, rather than introducing predefined biases. We introduce Mixture of Block Attention (MoBA), an innovative approach that applies the principles of Mixture of Experts (MoE) to the attention mechanism. This novel architecture demonstrates superior performance on long-context tasks while offering a key advantage: the ability to seamlessly transition between full and sparse attention, enhancing efficiency without the risk of compromising performance. MoBA has already been deployed to support Kimi's long-context requests and demonstrates significant advancements in efficient attention computation for LLMs. Our code is available at https://github.com/MoonshotAI/MoBA.

MoBA：長上下文大語言模型的區塊注意力混合機制

MoBA: Mixture of Block Attention for Long-Context LLMs

摘要

Support