大型语言模型的直接偏好知识蒸馏

摘要

在大型语言模型（LLMs）领域，知识蒸馏（KD）是将教师模型的能力转移到学生模型的关键技术。然而，现有的KD方法在LLMs的蒸馏中面临效率和传统KL散度测量能力不足等限制和挑战。研究表明，LLMs可以作为隐式奖励函数，我们将其定义为对KL散度的补充。在这项工作中，我们提出了用于LLMs的直接偏好知识蒸馏（DPKD）。DPKD利用分布散度来表示偏好损失和隐式奖励函数。我们将LLMs的KD重新构建为两个阶段：首先优化由隐式奖励和反向KL散度组成的目标，然后提高教师输出对学生输出的偏好概率。我们在各种数据集上进行了实验和分析，LLMs参数范围从120M到13B，并展示了我们的DPKD方法的广泛适用性和有效性。同时，我们通过实验和理论分析证明了引入的隐式奖励和输出偏好在KD中的价值和有效性。DPKD方法在输出响应精度和完全匹配百分比方面均优于基线方法。代码和数据可在https://aka.ms/dpkd获取。

English

In the field of large language models (LLMs), Knowledge Distillation (KD) is a critical technique for transferring capabilities from teacher models to student models. However, existing KD methods face limitations and challenges in distillation of LLMs, including efficiency and insufficient measurement capabilities of traditional KL divergence. It is shown that LLMs can serve as an implicit reward function, which we define as a supplement to KL divergence. In this work, we propose Direct Preference Knowledge Distillation (DPKD) for LLMs. DPKD utilizes distribution divergence to represent the preference loss and implicit reward function. We re-formulate KD of LLMs into two stages: first optimizing and objective consisting of implicit reward and reverse KL divergence and then improving the preference probability of teacher outputs over student outputs. We conducted experiments and analysis on various datasets with LLM parameters ranging from 120M to 13B and demonstrate the broad applicability and effectiveness of our DPKD approach. Meanwhile, we prove the value and effectiveness of the introduced implicit reward and output preference in KD through experiments and theoretical analysis. The DPKD method outperforms the baseline method in both output response precision and exact match percentage. Code and data are available at https://aka.ms/dpkd.

大型语言模型的直接偏好知识蒸馏

Direct Preference Knowledge Distillation for Large Language Models

摘要

Support