言語モデルのオンポリシー蒸留：自己生成された誤りからの学習

要旨

知識蒸留（KD）は、教師モデルを圧縮し、推論コストとメモリ使用量を削減するために、より小さな学生モデルを訓練する手法として広く用いられています。しかし、現在の自己回帰型シーケンスモデルに対するKD手法は、訓練中に見られる出力シーケンスと、推論時に学生モデルが生成するシーケンスとの間に分布の不一致が生じるという問題を抱えています。この問題を解決するため、我々は一般化知識蒸留（GKD）を提案します。GKDは、固定された出力シーケンスに依存するのではなく、学生モデルが自己生成した出力シーケンスに対して教師モデルからのフィードバックを活用して訓練を行います。教師ありKDアプローチとは異なり、GKDは学生モデルが教師モデルの分布を模倣する表現力を持たない場合に有用な、学生と教師の間の代替損失関数を柔軟に採用することができます。さらに、GKDは蒸留と強化学習による微調整（RLHF）をシームレスに統合することを可能にします。我々は、要約、翻訳、算術推論タスクにおける自己回帰型言語モデルの蒸留、および指示チューニングのためのタスク非依存の蒸留において、GKDの有効性を実証します。

English

Knowledge distillation (KD) is widely used for compressing a teacher model to reduce its inference cost and memory footprint, by training a smaller student model. However, current KD methods for auto-regressive sequence models suffer from distribution mismatch between output sequences seen during training and those generated by the student during inference. To address this issue, we introduce Generalized Knowledge Distillation (GKD). Instead of solely relying on a fixed set of output sequences, GKD trains the student on its self-generated output sequences by leveraging feedback from the teacher on such sequences. Unlike supervised KD approaches, GKD also offers the flexibility to employ alternative loss functions between the student and teacher, which can be useful when the student lacks the expressivity to mimic the teacher's distribution. Furthermore, GKD facilitates the seamless integration of distillation with RL fine-tuning (RLHF). We demonstrate the efficacy of GKD for distilling auto-regressive language models on summarization, translation, and arithmetic reasoning tasks, and task-agnostic distillation for instruction-tuning.

言語モデルのオンポリシー蒸留：自己生成された誤りからの学習

On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes

要旨

Support