通過緩解穩定性差距實現高效的持續預訓練

摘要

持續預訓練已逐漸成為調整大型語言模型（LLMs）以適應新領域的主要方法。此過程涉及使用來自新領域的語料庫更新預先訓練的LLM，導致訓練分佈的變化。為了研究LLMs在這種轉變期間的行為，我們在整個持續預訓練過程中測量了模型的表現。我們觀察到在開始時有一段暫時的表現下降，隨後進入一個恢復階段，這種現象被稱為“穩定性差距”，先前在對新類別進行分類的視覺模型中已有所記錄。為了解決這個問題並增強LLM在固定計算預算內的性能，我們提出了三種有效策略：（1）持續對LLM進行多個時期的子集預訓練，其大小適中，使性能恢復比在單個時期對LLM進行大語料庫預訓練更快；（2）僅在高質量子語料庫上對LLM進行預訓練，迅速提升領域性能；以及（3）使用與預訓練數據類似的數據混合以減少分佈差距。我們對Llama家族模型進行了各種實驗，以驗證我們策略在醫學持續預訓練和指導調整中的有效性。例如，我們的策略將OpenLlama-3B模型的平均醫學任務表現從36.2%提升至40.7%，僅使用原始訓練預算的40%，並增強了平均通用任務表現而不會導致遺忘。此外，我們將我們的策略應用於Llama-3-8B模型。結果模型Llama-3-Physician在當前開源模型中實現了最佳的醫學表現，並在幾個醫學基準測試中表現優於甚至與GPT-4相當。我們在https://huggingface.co/YiDuo1999/Llama-3-Physician-8B-Instruct 上釋出了我們的模型。

English

Continual pre-training has increasingly become the predominant approach for adapting Large Language Models (LLMs) to new domains. This process involves updating the pre-trained LLM with a corpus from a new domain, resulting in a shift in the training distribution. To study the behavior of LLMs during this shift, we measured the model's performance throughout the continual pre-training process. we observed a temporary performance drop at the beginning, followed by a recovery phase, a phenomenon known as the "stability gap," previously noted in vision models classifying new classes. To address this issue and enhance LLM performance within a fixed compute budget, we propose three effective strategies: (1) Continually pre-training the LLM on a subset with a proper size for multiple epochs, resulting in faster performance recovery than pre-training the LLM on a large corpus in a single epoch; (2) Pre-training the LLM only on high-quality sub-corpus, which rapidly boosts domain performance; and (3) Using a data mixture similar to the pre-training data to reduce distribution gap. We conduct various experiments on Llama-family models to validate the effectiveness of our strategies in both medical continual pre-training and instruction tuning. For example, our strategies improve the average medical task performance of the OpenLlama-3B model from 36.2% to 40.7% with only 40% of the original training budget and enhance the average general task performance without causing forgetting. Furthermore, we apply our strategies to the Llama-3-8B model. The resulting model, Llama-3-Physician, achieves the best medical performance among current open-source models, and performs comparably to or even better than GPT-4 on several medical benchmarks. We release our models at https://huggingface.co/YiDuo1999/Llama-3-Physician-8B-Instruct.

通過緩解穩定性差距實現高效的持續預訓練

Efficient Continual Pre-training by Mitigating the Stability Gap

摘要

Support