注意力残差
Attention Residuals
March 16, 2026
作者: Kimi Team, Guangyu Chen, Yu Zhang, Jianlin Su, Weixin Xu, Siyuan Pan, Yaoyu Wang, Yucheng Wang, Guanduo Chen, Bohong Yin, Yutian Chen, Junjie Yan, Ming Wei, Y. Zhang, Fanqing Meng, Chao Hong, Xiaotong Xie, Shaowei Liu, Enzhe Lu, Yunpeng Tai, Yanru Chen, Xin Men, Haiqing Guo, Y. Charles, Haoyu Lu, Lin Sui, Jinguo Zhu, Zaida Zhou, Weiran He, Weixiao Huang, Xinran Xu, Yuzhi Wang, Guokun Lai, Yulun Du, Yuxin Wu, Zhilin Yang, Xinyu Zhou
cs.AI
摘要
在现代大语言模型中,带前置归一化的残差连接已成为标准配置,但其采用固定单位权重累加所有层输出的方式,会导致隐藏状态随深度增长失控,逐渐稀释各层的贡献度。我们提出注意力残差(AttnRes)机制,通过基于前置层输出的软注意力替代固定累加,使每层能够根据学习到的输入依赖权重选择性聚合先前表征。为应对大规模模型训练中全量层输出注意力带来的内存与通信开销,我们进一步提出分块注意力残差(Block AttnRes),将网络层划分为多个块并对块级表征进行注意力计算,在保留全量AttnRes大部分优势的同时显著降低内存占用。结合基于缓存的流水线通信与两阶段计算策略,Block AttnRes可作为标准残差连接的高效替代方案,实现近乎零开销的即插即用。
缩放定律实验证实该改进在不同模型规模下均保持一致性,消融研究验证了内容依赖型深度选择机制的有效性。我们将AttnRes集成至Kimi Linear架构(总参数量480亿/激活参数量30亿),并在1.4万亿token上进行预训练。结果表明AttnRes有效缓解了前置归一化的稀释效应,使各深度层的输出幅度与梯度分布更趋均匀,并在所有评估任务中均提升了下游性能。
English
Residual connections with PreNorm are standard in modern LLMs, yet they accumulate all layer outputs with fixed unit weights. This uniform aggregation causes uncontrolled hidden-state growth with depth, progressively diluting each layer's contribution. We propose Attention Residuals (AttnRes), which replaces this fixed accumulation with softmax attention over preceding layer outputs, allowing each layer to selectively aggregate earlier representations with learned, input-dependent weights. To address the memory and communication overhead of attending over all preceding layer outputs for large-scale model training, we introduce Block AttnRes, which partitions layers into blocks and attends over block-level representations, reducing the memory footprint while preserving most of the gains of full AttnRes. Combined with cache-based pipeline communication and a two-phase computation strategy, Block AttnRes becomes a practical drop-in replacement for standard residual connections with minimal overhead.
Scaling law experiments confirm that the improvement is consistent across model sizes, and ablations validate the benefit of content-dependent depth-wise selection. We further integrate AttnRes into the Kimi Linear architecture (48B total / 3B activated parameters) and pre-train on 1.4T tokens, where AttnRes mitigates PreNorm dilution, yielding more uniform output magnitudes and gradient distribution across depth, and improves downstream performance across all evaluated tasks.