持续预训练大型语言模型的简单且可扩展策略
Simple and Scalable Strategies to Continually Pre-train Large Language Models
March 13, 2024
作者: Adam Ibrahim, Benjamin Thérien, Kshitij Gupta, Mats L. Richter, Quentin Anthony, Timothée Lesort, Eugene Belilovsky, Irina Rish
cs.AI
摘要
大型语言模型(LLMs)通常在数十亿标记上进行预训练,一旦有新数据可用就重新开始该过程。一种更高效的解决方案是持续对这些模型进行预训练,与重新训练相比,可以节省大量计算资源。然而,由新数据引起的分布转移通常会导致在先前数据上性能下降或对新数据适应不佳。在这项工作中,我们展示了通过简单且可扩展的学习率(LR)重新升温、LR重新衰减和重放先前数据的组合就足以使模型在最终损失和语言模型(LM)评估基准方面与完全从头开始重新训练的性能相匹配。具体而言,我们展示了在两个常用的LLM预训练数据集(英语到英语)之间存在的弱但现实的分布转移,以及在405M参数模型规模和大型数据集大小(数千亿标记)下更强烈的分布转移(英语到德语)情况下的结果。选择了更大规模实验中的弱但现实的转移,我们还发现我们的持续学习策略可以与10B参数LLM的重新训练基线相匹配。我们的结果表明,可以通过简单且可扩展的持续学习策略成功更新LLMs,仅使用计算资源的一小部分即可与重新训练基线相匹配。最后,受先前工作启发,我们提出了替代余弦学习率调度的方法,有助于规避LR重新升温引起的遗忘,并且不受固定标记预算的限制。
English
Large language models (LLMs) are routinely pre-trained on billions of tokens,
only to start the process over again once new data becomes available. A much
more efficient solution is to continually pre-train these models, saving
significant compute compared to re-training. However, the distribution shift
induced by new data typically results in degraded performance on previous data
or poor adaptation to the new data. In this work, we show that a simple and
scalable combination of learning rate (LR) re-warming, LR re-decaying, and
replay of previous data is sufficient to match the performance of fully
re-training from scratch on all available data, as measured by final loss and
language model (LM) evaluation benchmarks. Specifically, we show this for a
weak but realistic distribution shift between two commonly used LLM
pre-training datasets (EnglishrightarrowEnglish) and a stronger distribution
shift (EnglishrightarrowGerman) at the 405M parameter model scale with
large dataset sizes (hundreds of billions of tokens). Selecting the weak but
realistic shift for larger-scale experiments, we also find that our continual
learning strategies match the re-training baseline for a 10B parameter LLM. Our
results demonstrate that LLMs can be successfully updated via simple and
scalable continual learning strategies, matching the re-training baseline using
only a fraction of the compute. Finally, inspired by previous work, we propose
alternatives to the cosine learning rate schedule that help circumvent
forgetting induced by LR re-warming and that are not bound to a fixed token
budget.Summary
AI-Generated Summary