Abstract Large language models (LLMs) have demonstrated remarkable capabilities across various tasks. However, their performance on long-context tasks remains suboptimal due to the quadratic complexity of self-attention. To address this, we propose Mixture of Block Attention (MoBA), a novel attention mechanism that significantly reduces computational complexity while maintaining performance. MoBA divides the input sequence into fixed-size blocks and employs a mixture of local and global attention patterns. Local attention focuses on intra-block interactions, while global attention captures inter-block dependencies. This hybrid approach enables efficient processing of long sequences without sacrificing model quality. Extensive experiments on long-context benchmarks show that MoBA achieves competitive results with substantially lower computational overhead compared to standard self-attention. Our work provides a promising direction for scaling LLMs to handle longer contexts efficiently. 1 Introduction The success of large language models (LLMs) has revolutionized natural language processing (NLP). These models excel at various tasks, including text generation, translation, and question answering. However, their ability to process long-context inputs is limited by the quadratic complexity of self-attention, which becomes computationally prohibitive as sequence length increases. While several approaches have been proposed to mitigate this issue, such as sparse attention and linear attention, they often compromise model performance or introduce additional complexity. In this paper, we introduce Mixture of Block Attention (MoBA), a novel attention mechanism designed to address the limitations of existing approaches. MoBA leverages a combination of local and global attention patterns to efficiently process long sequences. By dividing the input into fixed-size blocks, MoBA reduces the computational complexity of attention while preserving the model's ability to capture both local and global dependencies. This hybrid approach allows LLMs to handle longer contexts without sacrificing performance. Our contributions are as follows: - We propose MoBA, a novel attention mechanism that combines local and global attention patterns to efficiently process long sequences. - We demonstrate through extensive experiments that MoBA achieves competitive performance on long-context benchmarks while significantly reducing computational overhead. - We provide a detailed analysis of MoBA's effectiveness and scalability, showing its potential for enabling LLMs to handle longer contexts efficiently. The remainder of this paper is organized as follows: Section 2 reviews related work on efficient attention mechanisms. Section 3 presents the MoBA architecture and its key components. Section 4 describes our experimental setup and results. Finally, Section 5 concludes the paper and discusses future directions.
MoBA: Mixture of Block Attention for Long-Context LLMs
February 18, 2025
Auteurs: Enzhe Lu, Zhejun Jiang, Jingyuan Liu, Yulun Du, Tao Jiang, Chao Hong, Shaowei Liu, Weiran He, Enming Yuan, Yuzhi Wang, Zhiqi Huang, Huan Yuan, Suting Xu, Xinran Xu, Guokun Lai, Yanru Chen, Huabin Zheng, Junjie Yan, Jianlin Su, Yuxin Wu, Neo Y. Zhang, Zhilin Yang, Xinyu Zhou, Mingxing Zhang, Jiezhong Qiu
cs.AI
Samenvatting
Het schalen van de effectieve contextlengte is essentieel voor de vooruitgang van grote taalmodellen (LLMs) richting kunstmatige algemene intelligentie (AGI). De kwadratische toename in rekencomplexiteit die inherent is aan traditionele aandachtmechanismen vormt echter een belemmerende overhead. Bestaande benaderingen leggen ofwel sterk bevooroordeelde structuren op, zoals sink- of venster-aandacht die taakspecifiek zijn, of wijzigen het aandachtmechanisme radicaal in lineaire benaderingen, waarvan de prestaties in complexe redeneertaken nog onvoldoende zijn onderzocht.
In dit werk stellen we een oplossing voor die het principe van "minder structuur" volgt, waardoor het model zelfstandig kan bepalen waar het aandacht aan moet besteden, in plaats van vooraf bepaalde biases in te voeren. We introduceren Mixture of Block Attention (MoBA), een innovatieve benadering die de principes van Mixture of Experts (MoE) toepast op het aandachtmechanisme. Deze nieuwe architectuur toont superieure prestaties bij taken met lange contexten en biedt een belangrijk voordeel: de mogelijkheid om naadloos over te schakelen tussen volledige en spaarzame aandacht, wat de efficiëntie verhoogt zonder het risico te lopen de prestaties te compromitteren. MoBA is al ingezet om de lange-contextverzoeken van Kimi te ondersteunen en toont significante vooruitgang in efficiënte aandachtberekening voor LLMs. Onze code is beschikbaar op https://github.com/MoonshotAI/MoBA.
English
Scaling the effective context length is essential for advancing large
language models (LLMs) toward artificial general intelligence (AGI). However,
the quadratic increase in computational complexity inherent in traditional
attention mechanisms presents a prohibitive overhead. Existing approaches
either impose strongly biased structures, such as sink or window attention
which are task-specific, or radically modify the attention mechanism into
linear approximations, whose performance in complex reasoning tasks remains
inadequately explored.
In this work, we propose a solution that adheres to the ``less structure''
principle, allowing the model to determine where to attend autonomously, rather
than introducing predefined biases. We introduce Mixture of Block Attention
(MoBA), an innovative approach that applies the principles of Mixture of
Experts (MoE) to the attention mechanism. This novel architecture demonstrates
superior performance on long-context tasks while offering a key advantage: the
ability to seamlessly transition between full and sparse attention, enhancing
efficiency without the risk of compromising performance. MoBA has already been
deployed to support Kimi's long-context requests and demonstrates significant
advancements in efficient attention computation for LLMs. Our code is available
at https://github.com/MoonshotAI/MoBA.Summary
AI-Generated Summary