MoBA: 장문맥 LLM을 위한 블록 어텐션 혼합 기법

초록

효과적인 컨텍스트 길이의 확장은 대규모 언어 모델(LLMs)이 인공 일반 지능(AGI)으로 나아가기 위해 필수적입니다. 그러나 전통적인 어텐션 메커니즘에서 발생하는 계산 복잡도의 2차 증가는 과도한 오버헤드를 초래합니다. 기존의 접근 방식들은 싱크(sink) 또는 윈도우(window) 어텐션과 같이 특정 작업에 맞춰진 강한 편향 구조를 도입하거나, 어텐션 메커니즘을 선형 근사로 근본적으로 수정하는 방식으로 이루어져 왔습니다. 후자의 경우 복잡한 추론 작업에서의 성능이 충분히 탐구되지 않았습니다. 이 연구에서 우리는 "덜 구조화된" 원칙을 준수하는 해결책을 제안하며, 사전 정의된 편향을 도입하는 대신 모델이 자율적으로 어디에 주의를 기울일지 결정할 수 있도록 합니다. 우리는 Mixture of Experts(MoE)의 원칙을 어텐션 메커니즘에 적용한 혁신적인 접근 방식인 Mixture of Block Attention(MoBA)을 소개합니다. 이 새로운 아키텍처는 긴 컨텍스트 작업에서 우수한 성능을 보이면서도, 전체 어텐션과 희소 어텐션 사이를 원활하게 전환할 수 있는 주요 이점을 제공합니다. 이를 통해 성능 저하의 위험 없이 효율성을 향상시킬 수 있습니다. MoBA는 이미 Kimi의 긴 컨텍스트 요청을 지원하기 위해 배포되었으며, LLMs의 효율적인 어텐션 계산에서 상당한 진전을 보여주고 있습니다. 우리의 코드는 https://github.com/MoonshotAI/MoBA에서 확인할 수 있습니다.

English

Scaling the effective context length is essential for advancing large language models (LLMs) toward artificial general intelligence (AGI). However, the quadratic increase in computational complexity inherent in traditional attention mechanisms presents a prohibitive overhead. Existing approaches either impose strongly biased structures, such as sink or window attention which are task-specific, or radically modify the attention mechanism into linear approximations, whose performance in complex reasoning tasks remains inadequately explored. In this work, we propose a solution that adheres to the ``less structure'' principle, allowing the model to determine where to attend autonomously, rather than introducing predefined biases. We introduce Mixture of Block Attention (MoBA), an innovative approach that applies the principles of Mixture of Experts (MoE) to the attention mechanism. This novel architecture demonstrates superior performance on long-context tasks while offering a key advantage: the ability to seamlessly transition between full and sparse attention, enhancing efficiency without the risk of compromising performance. MoBA has already been deployed to support Kimi's long-context requests and demonstrates significant advancements in efficient attention computation for LLMs. Our code is available at https://github.com/MoonshotAI/MoBA.

MoBA: 장문맥 LLM을 위한 블록 어텐션 혼합 기법

MoBA: Mixture of Block Attention for Long-Context LLMs

초록

Support