ChatPaper.aiChatPaper

指导预训练:语言模型是受监督的多任务学习者。

Instruction Pre-Training: Language Models are Supervised Multitask Learners

June 20, 2024
作者: Daixuan Cheng, Yuxian Gu, Shaohan Huang, Junyu Bi, Minlie Huang, Furu Wei
cs.AI

摘要

无监督多任务预训练是最近语言模型(LMs)取得成功的关键方法。然而,监督多任务学习仍然具有重要潜力,因为在后期训练阶段扩展它有助于更好的泛化。在本文中,我们通过提出指导预训练(Instruction Pre-Training)探索监督多任务预训练,这是一种框架,通过可扩展地增加大规模原始语料库中的指导-响应对来预训练LMs。这些指导-响应对是由一个基于开源模型构建的高效指导合成器生成的。在我们的实验中,我们合成了涵盖40多个任务类别的2亿个指导-响应对,以验证指导预训练的有效性。在从头开始的预训练中,指导预训练不仅始终增强了预训练基础模型,而且在进一步指导调整中获益更多。在持续预训练中,指导预训练使Llama3-8B能够与甚至胜过Llama3-70B。我们的模型、代码和数据可在 https://github.com/microsoft/LMOps 获取。
English
Unsupervised multitask pre-training has been the critical method behind the recent success of language models (LMs). However, supervised multitask learning still holds significant promise, as scaling it in the post-training stage trends towards better generalization. In this paper, we explore supervised multitask pre-training by proposing Instruction Pre-Training, a framework that scalably augments massive raw corpora with instruction-response pairs to pre-train LMs. The instruction-response pairs are generated by an efficient instruction synthesizer built on open-source models. In our experiments, we synthesize 200M instruction-response pairs covering 40+ task categories to verify the effectiveness of Instruction Pre-Training. In pre-training from scratch, Instruction Pre-Training not only consistently enhances pre-trained base models but also benefits more from further instruction tuning. In continual pre-training, Instruction Pre-Training enables Llama3-8B to be comparable to or even outperform Llama3-70B. Our model, code, and data are available at https://github.com/microsoft/LMOps.

Summary

AI-Generated Summary

PDF9425December 2, 2024