撞头注意力机制
Knocking-Heads Attention
October 27, 2025
作者: Zhanchao Zhou, Xiaodong Chen, Haoxing Chen, Zhenzhong Lan, Jianguo Li
cs.AI
摘要
多头注意力机制(MHA)已成为现代大语言模型的基石,通过并行注意力头增强表征能力。然而,增加头数会天然削弱单个头的容量,且现有注意力机制——无论是标准MHA还是其变体如分组查询注意力(GQA)和分组绑定注意力(GTA)——都只是简单拼接孤立头的输出,缺乏强交互性。为解决这一局限,我们提出碰撞头注意力机制(KHA),使注意力头能够在缩放点积注意力计算前相互“碰撞”,实现跨头的特征级交互。这是通过在所有头上应用共享且对角线初始化的投影矩阵实现的。对角线初始化在训练初期保持头的特异性,同时让模型逐步学习融合的跨头表征。KHA仅增加极少的参数和浮点运算量,可无缝集成到MHA、GQA、GTA等注意力变体中。我们通过使用1万亿高质量token训练60.1亿参数(激活10.1亿)的混合专家模型验证KHA。相较于基线注意力机制,KHA带来更优且更稳定的训练动态,在下游任务中取得更佳性能。
English
Multi-head attention (MHA) has become the cornerstone of modern large
language models, enhancing representational capacity through parallel attention
heads. However, increasing the number of heads inherently weakens individual
head capacity, and existing attention mechanisms - whether standard MHA or its
variants like grouped-query attention (GQA) and grouped-tied attention (GTA) -
simply concatenate outputs from isolated heads without strong interaction. To
address this limitation, we propose knocking-heads attention (KHA), which
enables attention heads to "knock" on each other - facilitating cross-head
feature-level interactions before the scaled dot-product attention. This is
achieved by applying a shared, diagonally-initialized projection matrix across
all heads. The diagonal initialization preserves head-specific specialization
at the start of training while allowing the model to progressively learn
integrated cross-head representations. KHA adds only minimal parameters and
FLOPs and can be seamlessly integrated into MHA, GQA, GTA, and other attention
variants. We validate KHA by training a 6.1B parameter MoE model (1.01B
activated) on 1T high-quality tokens. Compared to baseline attention
mechanisms, KHA brings superior and more stable training dynamics, achieving
better performance across downstream tasks.