通过缓解稳定性差距实现高效的持续预训练。

摘要

持续预训练已日益成为适应新领域的大型语言模型（LLMs）的主要方法。这一过程涉及使用新领域的语料库更新预训练的LLM，导致训练分布发生变化。为了研究LLMs在这一转变过程中的行为，我们测量了模型在持续预训练过程中的性能。我们观察到在开始阶段存在临时性能下降，随后是一个恢复阶段，这一现象被称为“稳定性差距”，之前在对新类别进行分类的视觉模型中已有所记录。为了解决这一问题并提升LLM在固定计算预算内的性能，我们提出了三种有效策略：（1）在适当大小的子集上持续预训练LLM多个时期，使性能恢复比在单个时期内在大语料库上预训练LLM更快；（2）仅在高质量子语料库上预训练LLM，快速提升领域性能；以及（3）使用类似于预训练数据的数据混合以减少分布差距。我们对Llama系列模型进行了各种实验，验证了我们策略在医学持续预训练和指导调整中的有效性。例如，我们的策略将OpenLlama-3B模型的平均医学任务性能从36.2%提高到40.7%，仅使用原始训练预算的40%，并提升了平均通用任务性能而不会导致遗忘。此外，我们将我们的策略应用于Llama-3-8B模型。由此产生的模型Llama-3-Physician，在当前开源模型中取得了最佳的医学性能，并在几个医学基准测试中表现出与甚至优于GPT-4的性能。我们在https://huggingface.co/YiDuo1999/Llama-3-Physician-8B-Instruct上发布了我们的模型。

English

Continual pre-training has increasingly become the predominant approach for adapting Large Language Models (LLMs) to new domains. This process involves updating the pre-trained LLM with a corpus from a new domain, resulting in a shift in the training distribution. To study the behavior of LLMs during this shift, we measured the model's performance throughout the continual pre-training process. we observed a temporary performance drop at the beginning, followed by a recovery phase, a phenomenon known as the "stability gap," previously noted in vision models classifying new classes. To address this issue and enhance LLM performance within a fixed compute budget, we propose three effective strategies: (1) Continually pre-training the LLM on a subset with a proper size for multiple epochs, resulting in faster performance recovery than pre-training the LLM on a large corpus in a single epoch; (2) Pre-training the LLM only on high-quality sub-corpus, which rapidly boosts domain performance; and (3) Using a data mixture similar to the pre-training data to reduce distribution gap. We conduct various experiments on Llama-family models to validate the effectiveness of our strategies in both medical continual pre-training and instruction tuning. For example, our strategies improve the average medical task performance of the OpenLlama-3B model from 36.2% to 40.7% with only 40% of the original training budget and enhance the average general task performance without causing forgetting. Furthermore, we apply our strategies to the Llama-3-8B model. The resulting model, Llama-3-Physician, achieves the best medical performance among current open-source models, and performs comparably to or even better than GPT-4 on several medical benchmarks. We release our models at https://huggingface.co/YiDuo1999/Llama-3-Physician-8B-Instruct.

通过缓解稳定性差距实现高效的持续预训练。

Efficient Continual Pre-training by Mitigating the Stability Gap

摘要

Support