Koppelende-Aandacht (Knocking-Heads Attention)

Samenvatting

Multi-head attention (MHA) is de hoeksteen geworden van moderne grote taalmodel(len), waarbij de representatiecapaciteit wordt vergroot door parallelle aandachtskoppen. Het vergroten van het aantal koppen verzwakt echter inherent de capaciteit van individuele koppen, en bestaande aandachtmechanismen - of het nu standaard MHA of varianten zoals grouped-query attention (GQA) en grouped-tied attention (GTA) zijn - voegen eenvoudigweg de uitvoeren van geïsoleerde koppen samen zonder sterke interactie. Om deze beperking aan te pakken, stellen wij knocking-heads attention (KHA) voor, waarbij aandachtskoppen op elkaar kunnen "kloppen" - dit vergemakkelijkt interacties op feature-niveau tussen koppen vóór de scaled dot-product attention. Dit wordt bereikt door een gedeelde, diagonaal geïnitialiseerde projectiematrix toe te passen op alle koppen. De diagonale initialisatie behoudt kop-specifieke specialisatie aan het begin van de training, terwijl het model geleidelijk geïntegreerde representaties tussen koppen kan leren. KHA voegt slechts minimale parameters en FLOPs toe en kan naadloos worden geïntegreerd in MHA, GQA, GTA en andere aandachtvarianten. Wij valideren KHA door een MoE-model met 6,1B parameters (1,01B geactiveerd) te trainen op 1T hoogwaardige tokens. In vergelijking met baseline-aandachtmechanismen biedt KHA superieure en stabielere trainingsdynamiek, wat resulteert in betere prestaties bij downstream-taken.

English

Multi-head attention (MHA) has become the cornerstone of modern large language models, enhancing representational capacity through parallel attention heads. However, increasing the number of heads inherently weakens individual head capacity, and existing attention mechanisms - whether standard MHA or its variants like grouped-query attention (GQA) and grouped-tied attention (GTA) - simply concatenate outputs from isolated heads without strong interaction. To address this limitation, we propose knocking-heads attention (KHA), which enables attention heads to "knock" on each other - facilitating cross-head feature-level interactions before the scaled dot-product attention. This is achieved by applying a shared, diagonally-initialized projection matrix across all heads. The diagonal initialization preserves head-specific specialization at the start of training while allowing the model to progressively learn integrated cross-head representations. KHA adds only minimal parameters and FLOPs and can be seamlessly integrated into MHA, GQA, GTA, and other attention variants. We validate KHA by training a 6.1B parameter MoE model (1.01B activated) on 1T high-quality tokens. Compared to baseline attention mechanisms, KHA brings superior and more stable training dynamics, achieving better performance across downstream tasks.