大规模语言模型持续预训练中的学习动态

摘要

持续预训练（Continual Pre-Training, CPT）已成为将强大基础模型应用于特定下游任务的一种流行且有效的方法。在本研究中，我们深入探讨了大型语言模型在CPT过程中的学习动态，特别关注了通用性能与下游领域性能在每一步训练中的演变情况，其中领域性能通过验证损失来衡量。我们观察到，CPT损失曲线本质上刻画了从一条曲线向另一条隐藏曲线过渡的过程，并可通过解耦分布偏移和学习率退火的影响来描述这一过程。我们推导出了一个结合这两个因素的CPT缩放定律，使得能够在任何（持续）训练步骤及不同学习率调度（LRS）下预测损失。我们的公式全面揭示了CPT中的几个关键因素，包括损失潜力、峰值学习率、训练步数、回放比例等。此外，我们的方法还能适应不同CPT目标定制训练超参数，如平衡通用性能与领域特定性能。大量实验证明，我们的缩放定律在多种CPT数据集和训练超参数下均成立。

English

Continual Pre-Training (CPT) has become a popular and effective method to apply strong foundation models to specific downstream tasks. In this work, we explore the learning dynamics throughout the CPT process for large language models. We specifically focus on how general and downstream domain performance evolves at each training step, with domain performance measured via validation losses. We have observed that the CPT loss curve fundamentally characterizes the transition from one curve to another hidden curve, and could be described by decoupling the effects of distribution shift and learning rate annealing. We derive a CPT scaling law that combines the two factors, enabling the prediction of loss at any (continual) training steps and across learning rate schedules (LRS) in CPT. Our formulation presents a comprehensive understanding of several critical factors in CPT, including loss potential, peak learning rate, training steps, replay ratio, etc. Moreover, our approach can be adapted to customize training hyper-parameters to different CPT goals such as balancing general and domain-specific performance. Extensive experiments demonstrate that our scaling law holds across various CPT datasets and training hyper-parameters.

大规模语言模型持续预训练中的学习动态

Learning Dynamics in Continual Pre-Training for Large Language Models

摘要

Support