ChatPaper.aiChatPaper

大型語言模型的直接偏好知識蒸餾

Direct Preference Knowledge Distillation for Large Language Models

June 28, 2024
作者: Yixing Li, Yuxian Gu, Li Dong, Dequan Wang, Yu Cheng, Furu Wei
cs.AI

摘要

在大型語言模型(LLMs)領域中,知識蒸餾(KD)是將教師模型的能力轉移到學生模型的關鍵技術。然而,現有的KD方法在LLMs的蒸餾方面面臨效率和傳統KL散度測量能力不足等挑戰和限制。研究表明,LLMs可以作為一種隱式獎勵函數,我們將其定義為KL散度的補充。在這項工作中,我們提出了用於LLMs的直接偏好知識蒸餾(DPKD)。DPKD利用分布散度來表示偏好損失和隱式獎勵函數。我們將LLMs的KD重新制定為兩個階段:首先優化由隱式獎勵和反向KL散度組成的目標,然後提高教師輸出對學生輸出的偏好概率。我們對不同數據集進行了實驗和分析,LLMs參數範圍從120M到13B,展示了我們的DPKD方法的廣泛適用性和有效性。同時,我們通過實驗和理論分析證明了引入的隱式獎勵和輸出偏好在KD中的價值和有效性。DPKD方法在輸出響應精度和完全匹配百分比方面優於基準方法。代碼和數據可在https://aka.ms/dpkd找到。
English
In the field of large language models (LLMs), Knowledge Distillation (KD) is a critical technique for transferring capabilities from teacher models to student models. However, existing KD methods face limitations and challenges in distillation of LLMs, including efficiency and insufficient measurement capabilities of traditional KL divergence. It is shown that LLMs can serve as an implicit reward function, which we define as a supplement to KL divergence. In this work, we propose Direct Preference Knowledge Distillation (DPKD) for LLMs. DPKD utilizes distribution divergence to represent the preference loss and implicit reward function. We re-formulate KD of LLMs into two stages: first optimizing and objective consisting of implicit reward and reverse KL divergence and then improving the preference probability of teacher outputs over student outputs. We conducted experiments and analysis on various datasets with LLM parameters ranging from 120M to 13B and demonstrate the broad applicability and effectiveness of our DPKD approach. Meanwhile, we prove the value and effectiveness of the introduced implicit reward and output preference in KD through experiments and theoretical analysis. The DPKD method outperforms the baseline method in both output response precision and exact match percentage. Code and data are available at https://aka.ms/dpkd.

Summary

AI-Generated Summary

PDF221November 29, 2024