增量注意力殘差

摘要

注意力殘差以學習過的 softmax 注意力機制取代標準加法殘差連接，作用於先前層的輸出，從而實現跨層的選擇性路由。然而，標準注意力殘差仍會關注先前層中高度重複的累積隱藏狀態。我們證明，這種冗餘性會導致深層網路中的路由坍縮：注意力權重變得低對比度且趨於均勻（最大權重約為 0.2），限制了模型從先前層中選取資訊豐富狀態的能力。這引發了一個關鍵但尚未充分探索的設計問題：注意力殘差中應對哪些層級表示進行路由？為回答此問題，我們提出增量注意力殘差（Delta Attention Residuals），其關注的是增量——即每個子層帶來的變化（v_i = h_{i+1} - h_i）——而非累積狀態。增量表徵在結構上具有多樣性，能產生更高對比度的注意力分佈（最大權重約為 0.6），從而實現更具選擇性與更有效的跨層路由。此原則同時適用於每個子層與區塊粒度的層級。在所有測試規模（220M 至 7.6B 參數）中，增量注意力殘差均一致優於標準殘差與注意力殘差，驗證困惑度提升達 1.7% 至 8.2%。此外，增量注意力殘差還可透過標準微調，將預訓練檢查點轉換為增量注意力殘差結構。程式碼已公開於 https://github.com/wdlctc/delta-attention-residuals-code。

English

Attention Residuals replace standard additive residual connections with learned softmax attention over previous layer outputs, enabling selective cross-layer routing. However, standard Attention Residuals still attend over cumulative hidden states in previous layers, which are highly redundant. We show that this redundancy leads to routing collapse in deeper layers: attention weights become low-contrast and closer to uniform (max weight {approx}0.2), limiting the model's ability to select informative states in previous layers. This raises a key but underexplored design question: what layer-wise representations should be routed in Attention Residuals? To answer this question, we propose Delta Attention Residuals, which attend over deltas -- the change introduced by each sublayer (v_i = h_{i+1} - h_i) -- instead of cumulative states. Delta representations are structurally diverse and yield higher-contrast attention distributions (max weight {approx}0.6), enabling more selective and effective routing across layers. This principle applies at both per-sublayer and block granularity. Across all tested scales (220M--7.6B), Delta Attention Residuals consistently outperform both standard residuals and Attention Residuals, with 1.7--8.2\% validation perplexity gains. Delta Attention Residuals also enables converting pretrained checkpoints into Delta Attention Residuals via standard fine-tuning. Code is available at https://github.com/wdlctc/delta-attention-residuals-code.