混合深度アテンション

要旨

深さのスケーリングは大規模言語モデル（LLM）における重要な推進力である。しかし、LLMが深くなるにつれて、信号劣化が生じることが多い。すなわち、浅い層で形成された情報豊富な特徴量が、繰り返される残差更新によって徐々に希薄化され、深い層で回復することが難しくなる。本論文では、混合深度注意（Mixture-of-Depths Attention, MoDA）を提案する。これは、各注意ヘッドが現在の層のシーケンスのキー・バリューペアと、先行する層からの深度のキー・バリューペアの両方に注意を向けることを可能にする機構である。さらに、非連続的なメモリアクセスパターンを解決し、シーケンス長64KにおいてFlashAttention-2の効率の97.3%を達成する、ハードウェア効率の良いMoDAアルゴリズムについて述べる。15億パラメータモデルを用いた実験により、MoDAが強力なベースラインを一貫して上回ることを実証した。特に、10の検証ベンチマークにおける平均パープレキシティを0.2改善し、10の下流タスクにおける平均性能を2.11%向上させ、FLOPs計算量のオーバーヘッドはわずか3.7%に抑えられた。また、MoDAを事前正規化ではなく事後正規化と組み合わせることで、より優れた性能が得られることも確認した。これらの結果は、MoDAが深度スケーリングのための有望な基本要素であることを示唆している。コードはhttps://github.com/hustvl/MoDAで公開されている。

English

Scaling depth is a key driver for large language models (LLMs). Yet, as LLMs become deeper, they often suffer from signal degradation: informative features formed in shallow layers are gradually diluted by repeated residual updates, making them harder to recover in deeper layers. We introduce mixture-of-depths attention (MoDA), a mechanism that allows each attention head to attend to sequence KV pairs at the current layer and depth KV pairs from preceding layers. We further describe a hardware-efficient algorithm for MoDA that resolves non-contiguous memory-access patterns, achieving 97.3% of FlashAttention-2's efficiency at a sequence length of 64K. Experiments on 1.5B-parameter models demonstrate that MoDA consistently outperforms strong baselines. Notably, it improves average perplexity by 0.2 across 10 validation benchmarks and increases average performance by 2.11% on 10 downstream tasks, with a negligible 3.7% FLOPs computational overhead. We also find that combining MoDA with post-norm yields better performance than using it with pre-norm. These results suggest that MoDA is a promising primitive for depth scaling. Code is released at https://github.com/hustvl/MoDA .