语言模型预训练中的渐进式残差预热策略

摘要

Transformer架构构成了大多数现代大语言模型的核心基础，因此其预训练稳定性与收敛速度至关重要。受序列堆叠层间逻辑依赖关系的启发，我们提出渐进式残差预热方法（ProRes）用于语言模型预训练。该方法践行"浅层先学"的理念，通过将每层残差乘以一个从0渐增至1的标量系数（深层对应更长的预热步数），使深层等待浅层进入稳定状态后再参与学习。我们在不同模型规模、归一化方案及初始化策略下的预训练实验证明了ProRes的有效性。综合分析表明，该方法不仅能提升预训练稳定性，还形成了独特的优化轨迹，从而实现更快收敛、更强泛化能力及更优下游性能。代码已开源：https://github.com/dandingsky/ProRes。

English

Transformer architectures serve as the backbone for most modern Large Language Models, therefore their pretraining stability and convergence speed are of central concern. Motivated by the logical dependency of sequentially stacked layers, we propose Progressive Residual Warmup (ProRes) for language model pretraining. ProRes implements an "early layer learns first" philosophy by multiplying each layer's residual with a scalar that gradually warms up from 0 to 1, with deeper layers taking longer warmup steps. In this way, deeper layers wait for early layers to settle into a more stable regime before contributing to learning. We demonstrate the effectiveness of ProRes through pretraining experiments across various model scales, as well as normalization and initialization schemes. Comprehensive analysis shows that ProRes not only stabilizes pretraining but also introduces a unique optimization trajectory, leading to faster convergence, stronger generalization and better downstream performance. Our code is available at https://github.com/dandingsky/ProRes.

语言模型预训练中的渐进式残差预热策略

Progressive Residual Warmup for Language Model Pretraining

摘要

Support