ノッキングヘッズ注意機構

要旨

マルチヘッドアテンション（MHA）は、並列的なアテンションヘッドを通じて表現能力を強化し、現代の大規模言語モデルの基盤技術となっている。しかし、ヘッド数の増加は本質的に個々のヘッドの能力を弱め、既存のアテンション機構——標準的なMHAや、グループ化クエリ注意（GQA）、グループ化結合注意（GTA）などの変種——は、強い相互作用なしに孤立したヘッドの出力を単純に結合しているに過ぎない。この課題を解決するため、本研究ではノッキングヘッズアテンション（KHA）を提案する。KHAは、スケーリング付き内積アテンションの前に、アテンションヘッド同士が相互に「ノック」し合うことで、ヘッド間の特徴レベルの相互作用を促進する。これは、全ヘッドにわたって共有され、対角行列で初期化された投影行列を適用することで実現する。対角初期化により、訓練開始時にはヘッド固有の特化性を保持しつつ、モデルが統合されたヘッド間表現を段階的に学習できるようにする。KHAは最小限のパラメータとFLOPsのみを追加し、MHA、GQA、GTAをはじめとする他のアテンション変種にシームレスに統合可能である。1Tの高品質トークンを用いて61億パラメータ（活性化10.1億）のMoEモデルを訓練し、KHAを検証した。ベースラインのアテンション機構と比較して、KHAは優れておりより安定した訓練ダイナミクスをもたらし、下流タスク全体でより良い性能を達成した。

English

Multi-head attention (MHA) has become the cornerstone of modern large language models, enhancing representational capacity through parallel attention heads. However, increasing the number of heads inherently weakens individual head capacity, and existing attention mechanisms - whether standard MHA or its variants like grouped-query attention (GQA) and grouped-tied attention (GTA) - simply concatenate outputs from isolated heads without strong interaction. To address this limitation, we propose knocking-heads attention (KHA), which enables attention heads to "knock" on each other - facilitating cross-head feature-level interactions before the scaled dot-product attention. This is achieved by applying a shared, diagonally-initialized projection matrix across all heads. The diagonal initialization preserves head-specific specialization at the start of training while allowing the model to progressively learn integrated cross-head representations. KHA adds only minimal parameters and FLOPs and can be seamlessly integrated into MHA, GQA, GTA, and other attention variants. We validate KHA by training a 6.1B parameter MoE model (1.01B activated) on 1T high-quality tokens. Compared to baseline attention mechanisms, KHA brings superior and more stable training dynamics, achieving better performance across downstream tasks.

ノッキングヘッズ注意機構

Knocking-Heads Attention

要旨

Support