語言模型預訓練中的漸進式殘差熱身策略（注：此標題採用學術翻譯策略，將"Progressive Residual Warmup"意譯為"漸進式殘差熱身策略"，既保留技術術語核心概念（殘差連接的漸進式初始化），又符合中文學術標題的表達習慣。"Pretraining"統一譯為業界標準譯法"預訓練"）

摘要

變壓器架構已成為多數現代大型語言模型的基礎骨架，因此其預訓練穩定性與收斂速度至關重要。基於序列堆疊層間的邏輯依賴性，我們提出用於語言模型預訓練的漸進殘差預熱法（ProRes）。該方法實踐「底層先學習」的理念，通過將每層殘差連接乘以一個從0漸進增至1的標量係數，並使深層網絡需要更長的預熱步數，從而讓深層等待底層先進入穩定狀態後再參與學習。我們通過在不同模型規模、歸一化方法及初始化策略下的預訓練實驗，驗證了ProRes的有效性。綜合分析表明，ProRes不僅能穩定預訓練過程，更創造出獨特的優化軌跡，實現更快的收斂速度、更強的泛化能力與更優的下游任務表現。相關代碼已開源於：https://github.com/dandingsky/ProRes。

English

Transformer architectures serve as the backbone for most modern Large Language Models, therefore their pretraining stability and convergence speed are of central concern. Motivated by the logical dependency of sequentially stacked layers, we propose Progressive Residual Warmup (ProRes) for language model pretraining. ProRes implements an "early layer learns first" philosophy by multiplying each layer's residual with a scalar that gradually warms up from 0 to 1, with deeper layers taking longer warmup steps. In this way, deeper layers wait for early layers to settle into a more stable regime before contributing to learning. We demonstrate the effectiveness of ProRes through pretraining experiments across various model scales, as well as normalization and initialization schemes. Comprehensive analysis shows that ProRes not only stabilizes pretraining but also introduces a unique optimization trajectory, leading to faster convergence, stronger generalization and better downstream performance. Our code is available at https://github.com/dandingsky/ProRes.

Progressive Residual Warmup for Language Model Pretraining

摘要

Support