注意力殘差

摘要

在現代大型語言模型中，帶有預歸一化的殘差連接已成為標準配置，但它們會以固定的單位權重累積所有層的輸出。這種均勻聚合會導致隱藏狀態隨深度無控制地增長，逐漸稀釋每一層的貢獻。我們提出注意力殘差（AttnRes），通過對前置層輸出進行softmax注意力計算來替代固定累積機制，使每一層能夠根據學習到的輸入依賴權重，選擇性聚合先前表徵。為解決大規模模型訓練中因關注所有前置層輸出而產生的記憶體與通信開銷，我們進一步提出分塊注意力殘差（Block AttnRes），將網絡層劃分為多個區塊，僅在區塊級表徵上進行注意力計算，在保留完整AttnRes大部分優勢的同時顯著降低記憶體佔用。結合基於快取的流水線通信與兩階段計算策略，Block AttnRes可作為標準殘差連接的實用替代方案，且額外開銷極低。縮放律實驗證實該改進在不同模型規模下均保持一致性，消融研究則驗證了內容依賴的深度選擇機制的有效性。我們進一步將AttnRes整合至Kimi Linear架構（總參數480億/激活參數30億），並在1.4兆詞元上進行預訓練。結果表明AttnRes有效緩解了預歸一化稀釋問題，使輸出幅值與梯度分佈在深度維度上更均勻，並在所有評估的下游任務中均取得性能提升。

English

Residual connections with PreNorm are standard in modern LLMs, yet they accumulate all layer outputs with fixed unit weights. This uniform aggregation causes uncontrolled hidden-state growth with depth, progressively diluting each layer's contribution. We propose Attention Residuals (AttnRes), which replaces this fixed accumulation with softmax attention over preceding layer outputs, allowing each layer to selectively aggregate earlier representations with learned, input-dependent weights. To address the memory and communication overhead of attending over all preceding layer outputs for large-scale model training, we introduce Block AttnRes, which partitions layers into blocks and attends over block-level representations, reducing the memory footprint while preserving most of the gains of full AttnRes. Combined with cache-based pipeline communication and a two-phase computation strategy, Block AttnRes becomes a practical drop-in replacement for standard residual connections with minimal overhead. Scaling law experiments confirm that the improvement is consistent across model sizes, and ablations validate the benefit of content-dependent depth-wise selection. We further integrate AttnRes into the Kimi Linear architecture (48B total / 3B activated parameters) and pre-train on 1.4T tokens, where AttnRes mitigates PreNorm dilution, yielding more uniform output magnitudes and gradient distribution across depth, and improves downstream performance across all evaluated tasks.