デルタ注意残差

要旨

注意残差は、標準的な加法残差接続を、前層出力に対する学習されたソフトマックス注意に置き換えることで、選択的な層間ルーティングを可能にする。しかし、標準の注意残差は依然として前層の累積的な隠れ状態（これは高度に冗長である）に注意を向ける。本論文では、この冗長性が深い層においてルーティング崩壊を引き起こすことを示す：注意重みが低コントラストとなり一様分布に近づき（最大重み≈0.2）、モデルが前層の情報豊富な状態を選択する能力が制限される。これにより、「注意残差において層ごとのどの表現をルーティングすべきか」という、重要ながら未解明な設計上の問いが浮上する。この問いに答えるため、我々はデルタ注意残差を提案する。これは累積状態ではなく、各サブ層が導入する変化（v_i = h_{i+1} - h_i）であるデルタに注意を向ける。デルタ表現は構造的に多様であり、より高コントラストな注意分布（最大重み≈0.6）を生成し、層間でのより選択的かつ効果的なルーティングを可能にする。この原理は、サブ層単位およびブロック単位の両方の粒度で適用可能である。テストしたすべてのスケール（220M～7.6B）において、デルタ注意残差は標準残差および注意残差の両方を一貫して上回り、検証パープレキシティで1.7～8.2%の改善を達成する。また、デルタ注意残差は、事前学習済みモデルを標準的なファインチューニングによりデルタ注意残差に変換することを可能にする。コードはhttps://github.com/wdlctc/delta-attention-residuals-codeで入手可能である。

English

Attention Residuals replace standard additive residual connections with learned softmax attention over previous layer outputs, enabling selective cross-layer routing. However, standard Attention Residuals still attend over cumulative hidden states in previous layers, which are highly redundant. We show that this redundancy leads to routing collapse in deeper layers: attention weights become low-contrast and closer to uniform (max weight {approx}0.2), limiting the model's ability to select informative states in previous layers. This raises a key but underexplored design question: what layer-wise representations should be routed in Attention Residuals? To answer this question, we propose Delta Attention Residuals, which attend over deltas -- the change introduced by each sublayer (v_i = h_{i+1} - h_i) -- instead of cumulative states. Delta representations are structurally diverse and yield higher-contrast attention distributions (max weight {approx}0.6), enabling more selective and effective routing across layers. This principle applies at both per-sublayer and block granularity. Across all tested scales (220M--7.6B), Delta Attention Residuals consistently outperform both standard residuals and Attention Residuals, with 1.7--8.2\% validation perplexity gains. Delta Attention Residuals also enables converting pretrained checkpoints into Delta Attention Residuals via standard fine-tuning. Code is available at https://github.com/wdlctc/delta-attention-residuals-code.