大型語言模型的知識蒸餾

摘要

知識蒸餾（KD）是一種有潛力減少大型語言模型（LLMs）高計算需求的技術。然而，先前的KD方法主要應用於白盒分類模型或訓練小模型來模仿像ChatGPT這樣的黑盒模型API。如何有效地從白盒生成式LLMs中提煉知識仍未得到深入探討，隨著LLMs的繁榮，這變得越來越重要。在這項工作中，我們提出MiniLLM，從生成式較大的語言模型中提煉出較小的語言模型。我們首先將標準KD方法中的前向Kullback-Leibler散度（KLD）目標替換為反向KLD，這對於在生成式語言模型上進行KD更為適用，以防止學生模型高估教師分佈的低概率區域。然後，我們提出了一種有效的優化方法來學習這個目標。在指令遵循設置中進行的大量實驗表明，MiniLLM模型生成的回應更準確，整體質量更高，暴露偏差更低，校準更好，長文本生成性能更高。我們的方法也適用於具有1.2億至130億參數的不同模型系列。我們將在https://aka.ms/MiniLLM 上發布我們的代碼和模型檢查點。

English

Knowledge Distillation (KD) is a promising technique for reducing the high computational demand of large language models (LLMs). However, previous KD methods are primarily applied to white-box classification models or training small models to imitate black-box model APIs like ChatGPT. How to effectively distill the knowledge from white-box generative LLMs is still under-explored, which becomes more and more important with the prosperity of LLMs. In this work, we propose MiniLLM that distills smaller language models from generative larger language models. We first replace the forward Kullback-Leibler divergence (KLD) objective in the standard KD approaches with reverse KLD, which is more suitable for KD on generative language models, to prevent the student model from overestimating the low-probability regions of the teacher distribution. Then, we derive an effective optimization approach to learn this objective. Extensive experiments in the instruction-following setting show that the MiniLLM models generate more precise responses with the higher overall quality, lower exposure bias, better calibration, and higher long-text generation performance. Our method is also scalable for different model families with 120M to 13B parameters. We will release our code and model checkpoints at https://aka.ms/MiniLLM.

大型語言模型的知識蒸餾

Knowledge Distillation of Large Language Models

摘要

Support