基于策略的语言模型蒸馏:从自动生成的错误中学习
On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes
June 23, 2023
作者: Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos, Matthieu Geist, Olivier Bachem
cs.AI
摘要
知识蒸馏(KD)被广泛用于压缩教师模型,以减少推理成本和内存占用,通过训练一个更小的学生模型。然而,目前针对自回归序列模型的KD方法存在训练中看到的输出序列与学生在推理过程中生成的序列之间的分布不匹配问题。为解决此问题,我们引入了广义知识蒸馏(GKD)。GKD不仅仅依赖于一组固定的输出序列,而是通过利用教师对这些序列的反馈,训练学生模型生成自身的输出序列。与监督式KD方法不同,GKD还提供了在学生和教师之间使用替代损失函数的灵活性,当学生缺乏表达教师分布的能力时,这将非常有用。此外,GKD促进了将蒸馏与RL微调(RLHF)无缝集成。我们展示了GKD在自回归语言模型在摘要、翻译和算术推理任务以及任务无关蒸馏中的有效性。
English
Knowledge distillation (KD) is widely used for compressing a teacher model to
reduce its inference cost and memory footprint, by training a smaller student
model. However, current KD methods for auto-regressive sequence models suffer
from distribution mismatch between output sequences seen during training and
those generated by the student during inference. To address this issue, we
introduce Generalized Knowledge Distillation (GKD). Instead of solely relying
on a fixed set of output sequences, GKD trains the student on its
self-generated output sequences by leveraging feedback from the teacher on such
sequences. Unlike supervised KD approaches, GKD also offers the flexibility to
employ alternative loss functions between the student and teacher, which can be
useful when the student lacks the expressivity to mimic the teacher's
distribution. Furthermore, GKD facilitates the seamless integration of
distillation with RL fine-tuning (RLHF). We demonstrate the efficacy of GKD for
distilling auto-regressive language models on summarization, translation, and
arithmetic reasoning tasks, and task-agnostic distillation for
instruction-tuning.