撞头注意力机制

摘要

多头注意力机制（MHA）已成为现代大语言模型的基石，通过并行注意力头增强表征能力。然而增加头数会天然削弱单个头的能力，且现有注意力机制——无论是标准MHA还是其变体如分组查询注意力（GQA）和分组绑定注意力（GTA）——都只是简单拼接孤立头的输出而缺乏强交互。为解决这一局限，我们提出碰撞头注意力机制（KHA），使注意力头能够相互"碰撞"，在缩放点积注意力计算前实现跨头的特征级交互。这是通过在所有头上应用共享且对角线初始化的投影矩阵实现的。对角线初始化在训练初期保留头的特异性，同时让模型逐步学习融合的跨头表征。KHA仅增加极少参数量和浮点运算量，可无缝集成到MHA、GQA、GTA等注意力变体中。我们通过在1万亿高质量词元上训练61亿参数（激活10.1亿）的混合专家模型验证KHA。相比基线注意力机制，KHA带来更优越稳定的训练动态，在下游任务中取得更优性能。

English

Multi-head attention (MHA) has become the cornerstone of modern large language models, enhancing representational capacity through parallel attention heads. However, increasing the number of heads inherently weakens individual head capacity, and existing attention mechanisms - whether standard MHA or its variants like grouped-query attention (GQA) and grouped-tied attention (GTA) - simply concatenate outputs from isolated heads without strong interaction. To address this limitation, we propose knocking-heads attention (KHA), which enables attention heads to "knock" on each other - facilitating cross-head feature-level interactions before the scaled dot-product attention. This is achieved by applying a shared, diagonally-initialized projection matrix across all heads. The diagonal initialization preserves head-specific specialization at the start of training while allowing the model to progressively learn integrated cross-head representations. KHA adds only minimal parameters and FLOPs and can be seamlessly integrated into MHA, GQA, GTA, and other attention variants. We validate KHA by training a 6.1B parameter MoE model (1.01B activated) on 1T high-quality tokens. Compared to baseline attention mechanisms, KHA brings superior and more stable training dynamics, achieving better performance across downstream tasks.

撞头注意力机制

Knocking-Heads Attention

摘要

Support