增量注意力残差

摘要

注意力残差（Attention Residuals）用学习到的softmax注意力机制替换标准加法残差连接，对前一层输出进行选择性跨层路由。然而，标准注意力残差仍然关注前几层的累积隐状态，这些状态高度冗余。我们发现这种冗余会导致深层路由崩溃：注意力权重变得低对比度且趋近均匀分布（最大权重约0.2），限制了模型选择前层信息性状态的能力。这引出一个关键但尚未充分探讨的设计问题：注意力残差中应该对哪些层间表示进行路由？为回答该问题，我们提出增量注意力残差（Delta Attention Residuals），该方法关注增量——即每个子层引入的变化（v_i = h_{i+1} - h_i）——而非累积状态。增量表示具有结构多样性，能产生更高对比度的注意力分布（最大权重约0.6），从而实现更具选择性和有效性的跨层路由。该原则同时适用于子层和块粒度。在全部测试规模（220M–7.6B）下，增量注意力残差始终优于标准残差和注意力残差，验证困惑度提升1.7–8.2%。此外，通过标准微调，可将预训练检查点转换为增量注意力残差。代码已开源：https://github.com/wdlctc/delta-attention-residuals-code。

English

Attention Residuals replace standard additive residual connections with learned softmax attention over previous layer outputs, enabling selective cross-layer routing. However, standard Attention Residuals still attend over cumulative hidden states in previous layers, which are highly redundant. We show that this redundancy leads to routing collapse in deeper layers: attention weights become low-contrast and closer to uniform (max weight {approx}0.2), limiting the model's ability to select informative states in previous layers. This raises a key but underexplored design question: what layer-wise representations should be routed in Attention Residuals? To answer this question, we propose Delta Attention Residuals, which attend over deltas -- the change introduced by each sublayer (v_i = h_{i+1} - h_i) -- instead of cumulative states. Delta representations are structurally diverse and yield higher-contrast attention distributions (max weight {approx}0.6), enabling more selective and effective routing across layers. This principle applies at both per-sublayer and block granularity. Across all tested scales (220M--7.6B), Delta Attention Residuals consistently outperform both standard residuals and Attention Residuals, with 1.7--8.2\% validation perplexity gains. Delta Attention Residuals also enables converting pretrained checkpoints into Delta Attention Residuals via standard fine-tuning. Code is available at https://github.com/wdlctc/delta-attention-residuals-code.