ChatPaper.aiChatPaper

大型语言模型中稀疏性何时缓解深度诅咒

When Does Sparsity Mitigate the Curse of Depth in LLMs

March 16, 2026
作者: Dilxat Muhtar, Xinyuan Song, Sebastian Pokutta, Max Zimmer, Nico Pelleriti, Thomas Hofmann, Shiwei Liu
cs.AI

摘要

近期研究表明,大型语言模型(LLMs)存在深度诅咒现象:深层网络对学习和表征的贡献度低于浅层。这种利用不足的问题与预层归一化中方差累积增长有关,该机制会使深层模块趋近恒等映射行为。本文论证了稀疏性不仅能提升效率,还可作为方差传播的调节器,从而改善深度利用率。我们探究了两种稀疏性来源:(一)隐式稀疏性,产生自训练与数据条件,包括权重衰减诱导的权重稀疏性和长上下文输入诱导的注意力稀疏性;(二)显式稀疏性,通过架构设计强制实现,包括分组查询注意力中的键值共享稀疏性和混合专家模型中的专家激活稀疏性。通过受控的深度扩展实验和针对性层效能干预,我们的主张获得了充分验证。在所有设定中,我们观察到一致规律:稀疏性通过降低输出方差和促进功能分化来提升层利用率。最终我们将研究结果提炼为可实践的深度高效LLMs训练经验法则,在下游任务中实现了4.6%的显著准确率提升。本研究揭示出稀疏性——这种源自标准设计选择的内在特性,是LLMs实现有效深度扩展的关键机制,而该机制此前一直被忽视。代码已发布于https://github.com/pUmpKin-Co/SparsityAndCoD。
English
Recent work has demonstrated the curse of depth in large language models (LLMs), where later layers contribute less to learning and representation than earlier layers. Such under-utilization is linked to the accumulated growth of variance in Pre-Layer Normalization, which can push deep blocks toward near-identity behavior. In this paper, we demonstrate that, sparsity, beyond enabling efficiency, acts as a regulator of variance propagation and thereby improves depth utilization. Our investigation covers two sources of sparsity: (i) implicit sparsity, which emerges from training and data conditions, including weight sparsity induced by weight decay and attention sparsity induced by long context inputs; and (ii) explicit sparsity, which is enforced by architectural design, including key/value-sharing sparsity in Grouped-Query Attention and expert-activation sparsity in Mixtureof-Experts. Our claim is thoroughly supported by controlled depth-scaling experiments and targeted layer effectiveness interventions. Across settings, we observe a consistent relationship: sparsity improves layer utilization by reducing output variance and promoting functional differentiation. We eventually distill our findings into a practical rule-of-thumb recipe for training deptheffective LLMs, yielding a notable 4.6% accuracy improvement on downstream tasks. Our results reveal sparsity, arising naturally from standard design choices, as a key yet previously overlooked mechanism for effective depth scaling in LLMs. Code is available at https://github.com/pUmpKin-Co/SparsityAndCoD.
PDF52March 18, 2026