지시 사전 학습: 언어 모델은 지도 다중 작업 학습자입니다

초록

비지도 다중 작업 사전 학습은 최근 언어 모델(LM)의 성공을 이끈 핵심 방법이었습니다. 그러나 지도 다중 작업 학습은 여전히 상당한 잠재력을 가지고 있으며, 사후 학습 단계에서 이를 확장하는 것이 더 나은 일반화로 이어지는 추세입니다. 본 논문에서는 지도 다중 작업 사전 학습을 탐구하기 위해 Instruction Pre-Training이라는 프레임워크를 제안합니다. 이 프레임워크는 대규모 원시 코퍼스를 명령어-응답 쌍으로 확장 가능하게 보강하여 LM을 사전 학습합니다. 명령어-응답 쌍은 오픈소스 모델을 기반으로 구축된 효율적인 명령어 합성기에 의해 생성됩니다. 실험에서는 40개 이상의 작업 범주를 포함하는 2억 개의 명령어-응답 쌍을 합성하여 Instruction Pre-Training의 효과를 검증했습니다. 처음부터 사전 학습할 때, Instruction Pre-Training은 사전 학습된 기본 모델을 지속적으로 개선할 뿐만 아니라 추가적인 명령어 튜닝에서 더 큰 이점을 얻었습니다. 지속적인 사전 학습에서는 Instruction Pre-Training이 Llama3-8B를 Llama3-70B와 비슷하거나 더 나은 성능을 발휘하도록 가능하게 했습니다. 우리의 모델, 코드, 데이터는 https://github.com/microsoft/LMOps에서 확인할 수 있습니다.

English

Unsupervised multitask pre-training has been the critical method behind the recent success of language models (LMs). However, supervised multitask learning still holds significant promise, as scaling it in the post-training stage trends towards better generalization. In this paper, we explore supervised multitask pre-training by proposing Instruction Pre-Training, a framework that scalably augments massive raw corpora with instruction-response pairs to pre-train LMs. The instruction-response pairs are generated by an efficient instruction synthesizer built on open-source models. In our experiments, we synthesize 200M instruction-response pairs covering 40+ task categories to verify the effectiveness of Instruction Pre-Training. In pre-training from scratch, Instruction Pre-Training not only consistently enhances pre-trained base models but also benefits more from further instruction tuning. In continual pre-training, Instruction Pre-Training enables Llama3-8B to be comparable to or even outperform Llama3-70B. Our model, code, and data are available at https://github.com/microsoft/LMOps.

지시 사전 학습: 언어 모델은 지도 다중 작업 학습자입니다

Instruction Pre-Training: Language Models are Supervised Multitask Learners

초록

Support