指令預訓練：語言模型是受監督的多任務學習者。

摘要

無監督多任務預訓練一直是最近語言模型（LMs）取得成功的關鍵方法。然而，監督多任務學習仍然具有重要潛力，因為在後訓練階段對其進行擴展有助於更好的泛化。本文通過提出指導預訓練（Instruction Pre-Training）框架，探索了監督多任務預訓練，該框架可通過可擴展地增加龐大的原始語料庫中的指導-回應對來預訓練LMs。指導-回應對是通過基於開源模型構建的高效指導合成器生成的。在我們的實驗中，我們合成了涵蓋40多個任務類別的2億指導-回應對，以驗證指導預訓練的有效性。在從頭開始的預訓練中，指導預訓練不僅持續增強預訓練基本模型，而且更多地受益於進一步的指導調整。在持續預訓練中，指導預訓練使Llama3-8B能夠與甚至優於Llama3-70B。我們的模型、代碼和數據可在https://github.com/microsoft/LMOps 上獲得。

English

Unsupervised multitask pre-training has been the critical method behind the recent success of language models (LMs). However, supervised multitask learning still holds significant promise, as scaling it in the post-training stage trends towards better generalization. In this paper, we explore supervised multitask pre-training by proposing Instruction Pre-Training, a framework that scalably augments massive raw corpora with instruction-response pairs to pre-train LMs. The instruction-response pairs are generated by an efficient instruction synthesizer built on open-source models. In our experiments, we synthesize 200M instruction-response pairs covering 40+ task categories to verify the effectiveness of Instruction Pre-Training. In pre-training from scratch, Instruction Pre-Training not only consistently enhances pre-trained base models but also benefits more from further instruction tuning. In continual pre-training, Instruction Pre-Training enables Llama3-8B to be comparable to or even outperform Llama3-70B. Our model, code, and data are available at https://github.com/microsoft/LMOps.

指令預訓練：語言模型是受監督的多任務學習者。

Instruction Pre-Training: Language Models are Supervised Multitask Learners

摘要

Support