指导预训练：语言模型是受监督的多任务学习者。

摘要

无监督多任务预训练是最近语言模型（LMs）取得成功的关键方法。然而，监督多任务学习仍然具有重要潜力，因为在后期训练阶段扩展它有助于更好的泛化。在本文中，我们通过提出指导预训练（Instruction Pre-Training）探索监督多任务预训练，这是一种框架，通过可扩展地增加大规模原始语料库中的指导-响应对来预训练LMs。这些指导-响应对是由一个基于开源模型构建的高效指导合成器生成的。在我们的实验中，我们合成了涵盖40多个任务类别的2亿个指导-响应对，以验证指导预训练的有效性。在从头开始的预训练中，指导预训练不仅始终增强了预训练基础模型，而且在进一步指导调整中获益更多。在持续预训练中，指导预训练使Llama3-8B能够与甚至胜过Llama3-70B。我们的模型、代码和数据可在 https://github.com/microsoft/LMOps 获取。

English

Unsupervised multitask pre-training has been the critical method behind the recent success of language models (LMs). However, supervised multitask learning still holds significant promise, as scaling it in the post-training stage trends towards better generalization. In this paper, we explore supervised multitask pre-training by proposing Instruction Pre-Training, a framework that scalably augments massive raw corpora with instruction-response pairs to pre-train LMs. The instruction-response pairs are generated by an efficient instruction synthesizer built on open-source models. In our experiments, we synthesize 200M instruction-response pairs covering 40+ task categories to verify the effectiveness of Instruction Pre-Training. In pre-training from scratch, Instruction Pre-Training not only consistently enhances pre-trained base models but also benefits more from further instruction tuning. In continual pre-training, Instruction Pre-Training enables Llama3-8B to be comparable to or even outperform Llama3-70B. Our model, code, and data are available at https://github.com/microsoft/LMOps.

指导预训练：语言模型是受监督的多任务学习者。

Instruction Pre-Training: Language Models are Supervised Multitask Learners

摘要

Support