持續預訓練大型語言模型的簡單且可擴展的策略
Simple and Scalable Strategies to Continually Pre-train Large Language Models
March 13, 2024
作者: Adam Ibrahim, Benjamin Thérien, Kshitij Gupta, Mats L. Richter, Quentin Anthony, Timothée Lesort, Eugene Belilovsky, Irina Rish
cs.AI
摘要
大型語言模型(LLMs)通常在數十億個標記上進行預訓練,一旦有新數據可用,就重新開始這個過程。一種更有效的解決方案是持續對這些模型進行預訓練,與重新訓練相比節省了大量計算資源。然而,由新數據引起的分佈變化通常導致在先前數據上性能下降或對新數據適應不佳。在這項研究中,我們展示了一種簡單且可擴展的學習率(LR)重新升溫、LR重新衰減和重播先前數據的組合足以與完全從頭重新訓練在所有可用數據上的性能相匹配,這是通過最終損失和語言模型(LM)評估基準來衡量的。具體來說,我們展示了在兩個常用的LLM預訓練數據集(英文到英文)之間的弱但現實的分佈轉變,以及在405M參數模型規模上具有大型數據集大小(數千億個標記)的更強分佈轉變(英文到德文)。選擇較大規模實驗的弱但現實轉變,我們還發現我們的持續學習策略與10B參數LLM的重新訓練基線相匹配。我們的結果表明,LLMs可以通過簡單且可擴展的持續學習策略成功更新,僅使用部分計算資源即可與重新訓練基線相匹配。最後,受先前工作的啟發,我們提出了替代餘弦學習率表的方法,有助於避免LR重新升溫引起的遺忘,並且不受固定標記預算的限制。
English
Large language models (LLMs) are routinely pre-trained on billions of tokens,
only to start the process over again once new data becomes available. A much
more efficient solution is to continually pre-train these models, saving
significant compute compared to re-training. However, the distribution shift
induced by new data typically results in degraded performance on previous data
or poor adaptation to the new data. In this work, we show that a simple and
scalable combination of learning rate (LR) re-warming, LR re-decaying, and
replay of previous data is sufficient to match the performance of fully
re-training from scratch on all available data, as measured by final loss and
language model (LM) evaluation benchmarks. Specifically, we show this for a
weak but realistic distribution shift between two commonly used LLM
pre-training datasets (EnglishrightarrowEnglish) and a stronger distribution
shift (EnglishrightarrowGerman) at the 405M parameter model scale with
large dataset sizes (hundreds of billions of tokens). Selecting the weak but
realistic shift for larger-scale experiments, we also find that our continual
learning strategies match the re-training baseline for a 10B parameter LLM. Our
results demonstrate that LLMs can be successfully updated via simple and
scalable continual learning strategies, matching the re-training baseline using
only a fraction of the compute. Finally, inspired by previous work, we propose
alternatives to the cosine learning rate schedule that help circumvent
forgetting induced by LR re-warming and that are not bound to a fixed token
budget.Summary
AI-Generated Summary