Abstract Large language models (LLMs) have demonstrated remarkable capabilities across various tasks. However, their performance on long-context tasks remains suboptimal due to the quadratic complexity of self-attention. To address this, we propose Mixture of Block Attention (MoBA), a novel attention mechanism that significantly reduces computational complexity while maintaining performance. MoBA divides the input sequence into fixed-size blocks and employs a mixture of local and global attention patterns. Local attention focuses on intra-block interactions, while global attention captures inter-block dependencies. This hybrid approach enables efficient processing of long sequences without sacrificing model quality. Extensive experiments on long-context benchmarks show that MoBA achieves competitive results with substantially lower computational overhead compared to standard self-attention. Our work provides a promising direction for scaling LLMs to handle longer contexts efficiently. 1 Introduction The success of large language models (LLMs) has revolutionized natural language processing (NLP). These models excel at various tasks, including text generation, translation, and question answering. However, their ability to process long-context inputs is limited by the quadratic complexity of self-attention, which becomes computationally prohibitive as sequence length increases. While several approaches have been proposed to mitigate this issue, such as sparse attention and linear attention, they often compromise model performance or introduce additional complexity. In this paper, we introduce Mixture of Block Attention (MoBA), a novel attention mechanism designed to address the limitations of existing approaches. MoBA leverages a combination of local and global attention patterns to efficiently process long sequences. By dividing the input into fixed-size blocks, MoBA reduces the computational complexity of attention while preserving the model's ability to capture both local and global dependencies. This hybrid approach allows LLMs to handle longer contexts without sacrificing performance. Our contributions are as follows: - We propose MoBA, a novel attention mechanism that combines local and global attention patterns to efficiently process long sequences. - We demonstrate through extensive experiments that MoBA achieves competitive performance on long-context benchmarks while significantly reducing computational overhead. - We provide a detailed analysis of MoBA's effectiveness and scalability, showing its potential for enabling LLMs to handle longer contexts efficiently. The remainder of this paper is organized as follows: Section 2 reviews related work on efficient attention mechanisms. Section 3 presents the MoBA architecture and its key components. Section 4 describes our experimental setup and results. Finally, Section 5 concludes the paper and discusses future directions.

Samenvatting

Het schalen van de effectieve contextlengte is essentieel voor de vooruitgang van grote taalmodellen (LLMs) richting kunstmatige algemene intelligentie (AGI). De kwadratische toename in rekencomplexiteit die inherent is aan traditionele aandachtmechanismen vormt echter een belemmerende overhead. Bestaande benaderingen leggen ofwel sterk bevooroordeelde structuren op, zoals sink- of venster-aandacht die taakspecifiek zijn, of wijzigen het aandachtmechanisme radicaal in lineaire benaderingen, waarvan de prestaties in complexe redeneertaken nog onvoldoende zijn onderzocht. In dit werk stellen we een oplossing voor die het principe van "minder structuur" volgt, waardoor het model zelfstandig kan bepalen waar het aandacht aan moet besteden, in plaats van vooraf bepaalde biases in te voeren. We introduceren Mixture of Block Attention (MoBA), een innovatieve benadering die de principes van Mixture of Experts (MoE) toepast op het aandachtmechanisme. Deze nieuwe architectuur toont superieure prestaties bij taken met lange contexten en biedt een belangrijk voordeel: de mogelijkheid om naadloos over te schakelen tussen volledige en spaarzame aandacht, wat de efficiëntie verhoogt zonder het risico te lopen de prestaties te compromitteren. MoBA is al ingezet om de lange-contextverzoeken van Kimi te ondersteunen en toont significante vooruitgang in efficiënte aandachtberekening voor LLMs. Onze code is beschikbaar op https://github.com/MoonshotAI/MoBA.

English

Scaling the effective context length is essential for advancing large language models (LLMs) toward artificial general intelligence (AGI). However, the quadratic increase in computational complexity inherent in traditional attention mechanisms presents a prohibitive overhead. Existing approaches either impose strongly biased structures, such as sink or window attention which are task-specific, or radically modify the attention mechanism into linear approximations, whose performance in complex reasoning tasks remains inadequately explored. In this work, we propose a solution that adheres to the ``less structure'' principle, allowing the model to determine where to attend autonomously, rather than introducing predefined biases. We introduce Mixture of Block Attention (MoBA), an innovative approach that applies the principles of Mixture of Experts (MoE) to the attention mechanism. This novel architecture demonstrates superior performance on long-context tasks while offering a key advantage: the ability to seamlessly transition between full and sparse attention, enhancing efficiency without the risk of compromising performance. MoBA has already been deployed to support Kimi's long-context requests and demonstrates significant advancements in efficient attention computation for LLMs. Our code is available at https://github.com/MoonshotAI/MoBA.

MoBA: Mixture of Block Attention for Long-Context LLMs

Samenvatting

Summary

Support

Support