大規模言語モデルの継続的事前学習における動的学習

要旨

継続的事前学習（Continual Pre-Training, CPT）は、強力な基盤モデルを特定の下流タスクに適用するための人気かつ効果的な手法となっています。本研究では、大規模言語モデルにおけるCPTプロセス全体を通じた学習ダイナミクスを探求します。具体的には、各訓練ステップにおいて一般的な性能と下流ドメインの性能がどのように進化するかに焦点を当て、ドメイン性能は検証損失を通じて測定します。我々は、CPTの損失曲線が本質的に一つの曲線から別の隠れた曲線への遷移を特徴づけ、分布シフトと学習率アニーリングの効果を分離することで記述可能であることを観察しました。我々は、これら2つの要因を組み合わせたCPTスケーリング則を導出し、任意の（継続的な）訓練ステップおよびCPTにおける学習率スケジュール（LRS）全体での損失を予測可能にします。本定式化は、損失ポテンシャル、ピーク学習率、訓練ステップ、リプレイ比率など、CPTにおけるいくつかの重要な要因を包括的に理解することを提示します。さらに、本アプローチは、一般的な性能とドメイン固有の性能のバランスを取るなど、異なるCPT目標に応じて訓練ハイパーパラメータをカスタマイズするために適応可能です。大規模な実験により、本スケーリング則が様々なCPTデータセットおよび訓練ハイパーパラメータにわたって成立することが実証されています。

English

Continual Pre-Training (CPT) has become a popular and effective method to apply strong foundation models to specific downstream tasks. In this work, we explore the learning dynamics throughout the CPT process for large language models. We specifically focus on how general and downstream domain performance evolves at each training step, with domain performance measured via validation losses. We have observed that the CPT loss curve fundamentally characterizes the transition from one curve to another hidden curve, and could be described by decoupling the effects of distribution shift and learning rate annealing. We derive a CPT scaling law that combines the two factors, enabling the prediction of loss at any (continual) training steps and across learning rate schedules (LRS) in CPT. Our formulation presents a comprehensive understanding of several critical factors in CPT, including loss potential, peak learning rate, training steps, replay ratio, etc. Moreover, our approach can be adapted to customize training hyper-parameters to different CPT goals such as balancing general and domain-specific performance. Extensive experiments demonstrate that our scaling law holds across various CPT datasets and training hyper-parameters.

大規模言語モデルの継続的事前学習における動的学習

Learning Dynamics in Continual Pre-Training for Large Language Models

要旨

Support