注意残差 - 論文詳細

要旨

現代の大規模言語モデルでは、PreNormを伴う残差接続が標準的であるが、これらは全ての層の出力を固定の単位重みで累積する。この均一な集約は制御不能な隠れ状態の成長を深さとともに引き起こし、各層の寄与を次第に希薄化させる。我々はAttention Residuals（AttnRes）を提案する。これは固定された累積を、先行する層の出力に対するソフトマックス注意に置き換え、各層が学習された入力依存の重みを用いて先行する表現を選択的に集約することを可能にする。大規模モデル訓練において全ての先行層の出力に注意を向ける際のメモリと通信のオーバーヘッドに対処するため、層をブロックに分割しブロックレベルの表現に注意を向けるBlock AttnResを導入する。これによりメモリ使用量を削減しつつ、完全なAttnResの利点の大部分を保持する。キャッシュベースのパイプライン通信と2段階計算戦略と組み合わせることで、Block AttnResは最小限のオーバーヘッドで標準的な残差接続の実用的なドロップイン代替となる。スケーリング則実験により、改善効果がモデルサイズ間で一貫していることが確認され、アブレーション研究はコンテンツ依存の深さ方向選択の利点を検証した。さらにAttnResをKimi Linearアーキテクチャ（総パラメータ48B／活性化パラメータ3B）に統合し、1.4Tトークンで事前学習を実施した。その結果、AttnResはPreNormの希薄化を緩和し、深度全体でより均一な出力大きさと勾配分布をもたらし、評価した全ての下流タスクにおいて性能向上が確認された。

English

Residual connections with PreNorm are standard in modern LLMs, yet they accumulate all layer outputs with fixed unit weights. This uniform aggregation causes uncontrolled hidden-state growth with depth, progressively diluting each layer's contribution. We propose Attention Residuals (AttnRes), which replaces this fixed accumulation with softmax attention over preceding layer outputs, allowing each layer to selectively aggregate earlier representations with learned, input-dependent weights. To address the memory and communication overhead of attending over all preceding layer outputs for large-scale model training, we introduce Block AttnRes, which partitions layers into blocks and attends over block-level representations, reducing the memory footprint while preserving most of the gains of full AttnRes. Combined with cache-based pipeline communication and a two-phase computation strategy, Block AttnRes becomes a practical drop-in replacement for standard residual connections with minimal overhead. Scaling law experiments confirm that the improvement is consistent across model sizes, and ablations validate the benefit of content-dependent depth-wise selection. We further integrate AttnRes into the Kimi Linear architecture (48B total / 3B activated parameters) and pre-train on 1.4T tokens, where AttnRes mitigates PreNorm dilution, yielding more uniform output magnitudes and gradient distribution across depth, and improves downstream performance across all evaluated tasks.