大型語言模型持續預訓練中的學習動態
Learning Dynamics in Continual Pre-Training for Large Language Models
May 12, 2025
作者: Xingjin Wang, Howe Tissue, Lu Wang, Linjing Li, Daniel Dajun Zeng
cs.AI
摘要
持續預訓練(Continual Pre-Training, CPT)已成為將強大的基礎模型應用於特定下游任務的一種流行且有效的方法。在本研究中,我們探討了大型語言模型在CPT過程中的學習動態。我們特別關注通用性能和下游領域性能在每個訓練步驟中的演變,並通過驗證損失來衡量領域性能。我們觀察到,CPT損失曲線本質上描述了從一條曲線到另一條隱藏曲線的轉變,並可以通過解耦分佈偏移和學習率退火的影響來描述。我們推導了一個結合這兩個因素的CPT縮放定律,使得能夠預測在任何(持續)訓練步驟和不同學習率調度(LRS)下的損失。我們的公式全面呈現了CPT中的幾個關鍵因素,包括損失潛力、峰值學習率、訓練步驟、重播比例等。此外,我們的方法可以適應不同的CPT目標,例如平衡通用性能和領域特定性能,從而定制訓練超參數。大量實驗表明,我們的縮放定律在多種CPT數據集和訓練超參數下均成立。
English
Continual Pre-Training (CPT) has become a popular and effective method to
apply strong foundation models to specific downstream tasks. In this work, we
explore the learning dynamics throughout the CPT process for large language
models. We specifically focus on how general and downstream domain performance
evolves at each training step, with domain performance measured via validation
losses. We have observed that the CPT loss curve fundamentally characterizes
the transition from one curve to another hidden curve, and could be described
by decoupling the effects of distribution shift and learning rate annealing. We
derive a CPT scaling law that combines the two factors, enabling the prediction
of loss at any (continual) training steps and across learning rate schedules
(LRS) in CPT. Our formulation presents a comprehensive understanding of several
critical factors in CPT, including loss potential, peak learning rate, training
steps, replay ratio, etc. Moreover, our approach can be adapted to customize
training hyper-parameters to different CPT goals such as balancing general and
domain-specific performance. Extensive experiments demonstrate that our scaling
law holds across various CPT datasets and training hyper-parameters.Summary
AI-Generated Summary