MiniPLM:用于预训练语言模型的知识蒸馏
MiniPLM: Knowledge Distillation for Pre-Training Language Models
October 22, 2024
作者: Yuxian Gu, Hao Zhou, Fandong Meng, Jie Zhou, Minlie Huang
cs.AI
摘要
知识蒸馏(KD)被广泛用于训练小型、高性能的学生语言模型(LMs),使用大型教师LMs。虽然在微调中有效,但在预训练期间进行的知识蒸馏面临效率、灵活性和有效性方面的挑战。现有方法要么由于在线教师推断而产生高计算成本,要么需要教师和学生LMs之间的标记匹配,或者存在风险丢失教师生成的训练数据的难度和多样性。为了解决这些问题,我们提出了MiniPLM,这是一个用于通过教师的知识优化训练数据分布的LMs预训练的KD框架。为了提高效率,MiniPLM执行离线教师LM推断,允许对多个学生LM进行KD而无需增加训练时间成本。为了提高灵活性,MiniPLM仅在训练语料库上运行,实现了跨模型系列的KD。为了提高有效性,MiniPLM利用大型和小型LM之间的差异来增强训练数据的难度和多样性,帮助学生LM获得多才多艺和复杂的知识。大量实验证明,MiniPLM提升了学生LM在9个广泛使用的下游任务上的性能,提高了语言建模能力,并减少了预训练计算量。MiniPLM的好处延伸到大规模的预训练范围,通过缩放曲线的外推得到证明。进一步分析表明,MiniPLM支持跨模型系列的KD,并增强了预训练数据的利用。我们的模型、代码和数据可在https://github.com/thu-coai/MiniPLM 上获得。
English
Knowledge distillation (KD) is widely used to train small, high-performing
student language models (LMs) using large teacher LMs. While effective in
fine-tuning, KD during pre-training faces challenges in efficiency,
flexibility, and effectiveness. Existing methods either incur high
computational costs due to online teacher inference, require tokenization
matching between teacher and student LMs, or risk losing the difficulty and
diversity of the teacher-generated training data. To address these issues, we
propose MiniPLM, a KD framework for pre-training LMs by refining the training
data distribution with the teacher's knowledge. For efficiency, MiniPLM
performs offline teacher LM inference, allowing KD for multiple student LMs
without adding training-time costs. For flexibility, MiniPLM operates solely on
the training corpus, enabling KD across model families. For effectiveness,
MiniPLM leverages the differences between large and small LMs to enhance the
difficulty and diversity of the training data, helping student LMs acquire
versatile and sophisticated knowledge. Extensive experiments demonstrate that
MiniPLM boosts the student LMs' performance on 9 widely used downstream tasks,
improves the language modeling capabilities, and reduces pre-training
computation. The benefit of MiniPLM extends to large pre-training scales,
evidenced by the extrapolation of the scaling curves. Further analysis reveals
that MiniPLM supports KD across model families and enhances the utilization of
pre-training data. Our model, code, and data are available at
https://github.com/thu-coai/MiniPLM.