基於政策的語言模型蒸餾:從自生成的錯誤中學習
On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes
June 23, 2023
作者: Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos, Matthieu Geist, Olivier Bachem
cs.AI
摘要
知識蒸餾(KD)被廣泛用於壓縮教師模型,以降低推論成本和記憶體佔用,透過訓練較小的學生模型。然而,目前針對自回歸序列模型的KD方法存在著訓練期間觀察到的輸出序列與學生在推論期間生成的序列之間的分佈不匹配問題。為解決此問題,我們引入了廣義知識蒸餾(GKD)。GKD不僅僅依賴於一組固定的輸出序列,而是訓練學生模型在其自行生成的輸出序列上,並利用來自教師對這些序列的反饋。與監督式KD方法不同,GKD還提供了在學生缺乏表達能力以模仿教師分佈時使用替代損失函數的靈活性。此外,GKD促進了將蒸餾與RL微調(RLHF)無縫集成。我們展示了GKD在自動回歸語言模型的摘要、翻譯和算術推理任務以及任務不可知的指導調整方面的蒸餾效果。
English
Knowledge distillation (KD) is widely used for compressing a teacher model to
reduce its inference cost and memory footprint, by training a smaller student
model. However, current KD methods for auto-regressive sequence models suffer
from distribution mismatch between output sequences seen during training and
those generated by the student during inference. To address this issue, we
introduce Generalized Knowledge Distillation (GKD). Instead of solely relying
on a fixed set of output sequences, GKD trains the student on its
self-generated output sequences by leveraging feedback from the teacher on such
sequences. Unlike supervised KD approaches, GKD also offers the flexibility to
employ alternative loss functions between the student and teacher, which can be
useful when the student lacks the expressivity to mimic the teacher's
distribution. Furthermore, GKD facilitates the seamless integration of
distillation with RL fine-tuning (RLHF). We demonstrate the efficacy of GKD for
distilling auto-regressive language models on summarization, translation, and
arithmetic reasoning tasks, and task-agnostic distillation for
instruction-tuning.