MiniPLM:用於預訓練語言模型的知識蒸餾
MiniPLM: Knowledge Distillation for Pre-Training Language Models
October 22, 2024
作者: Yuxian Gu, Hao Zhou, Fandong Meng, Jie Zhou, Minlie Huang
cs.AI
摘要
知識蒸餾(KD)被廣泛應用於訓練小型、高性能的學生語言模型(LMs),使用大型教師LMs。儘管在微調中有效,但在預訓練期間進行的知識蒸餾面臨效率、靈活性和有效性方面的挑戰。現有方法要麼由於在線教師推斷而產生高計算成本,要麼需要在教師和學生LMs之間進行標記匹配,或者冒著失去教師生成的訓練數據的困難性和多樣性的風險。為解決這些問題,我們提出了MiniPLM,一個用於通過教師的知識來改進訓練數據分佈的KD框架,以預訓練LMs。為了提高效率,MiniPLM進行離線教師LM推斷,使得可以在不增加訓練時間成本的情況下對多個學生LM進行KD。為了提高靈活性,MiniPLM僅在訓練語料庫上運行,實現跨模型家族的KD。為了提高有效性,MiniPLM利用大型和小型LM之間的差異來增強訓練數據的困難性和多樣性,幫助學生LM獲取多才多藝和複雜的知識。廣泛的實驗表明,MiniPLM提高了學生LM在9個廣泛使用的下游任務上的性能,改善了語言建模能力,並減少了預訓練計算。MiniPLM的好處延伸到大型預訓練規模,通過對比例曲線的外推來證明。進一步的分析顯示,MiniPLM支持跨模型家族的KD,並增強了對預訓練數據的利用。我們的模型、代碼和數據可在https://github.com/thu-coai/MiniPLM 上獲得。
English
Knowledge distillation (KD) is widely used to train small, high-performing
student language models (LMs) using large teacher LMs. While effective in
fine-tuning, KD during pre-training faces challenges in efficiency,
flexibility, and effectiveness. Existing methods either incur high
computational costs due to online teacher inference, require tokenization
matching between teacher and student LMs, or risk losing the difficulty and
diversity of the teacher-generated training data. To address these issues, we
propose MiniPLM, a KD framework for pre-training LMs by refining the training
data distribution with the teacher's knowledge. For efficiency, MiniPLM
performs offline teacher LM inference, allowing KD for multiple student LMs
without adding training-time costs. For flexibility, MiniPLM operates solely on
the training corpus, enabling KD across model families. For effectiveness,
MiniPLM leverages the differences between large and small LMs to enhance the
difficulty and diversity of the training data, helping student LMs acquire
versatile and sophisticated knowledge. Extensive experiments demonstrate that
MiniPLM boosts the student LMs' performance on 9 widely used downstream tasks,
improves the language modeling capabilities, and reduces pre-training
computation. The benefit of MiniPLM extends to large pre-training scales,
evidenced by the extrapolation of the scaling curves. Further analysis reveals
that MiniPLM supports KD across model families and enhances the utilization of
pre-training data. Our model, code, and data are available at
https://github.com/thu-coai/MiniPLM.Summary
AI-Generated Summary