대규모 언어 모델을 위한 직접적 선호 지식 증류

초록

대규모 언어 모델(LLM) 분야에서 지식 증류(Knowledge Distillation, KD)는 교사 모델의 능력을 학생 모델로 전이하는 핵심 기술입니다. 그러나 기존의 KD 방법들은 LLM 증류 과정에서 효율성 문제와 전통적인 KL 발산(Kullback-Leibler divergence)의 측정 능력 부족과 같은 한계와 도전에 직면해 있습니다. 연구에 따르면, LLM은 KL 발산을 보완하는 암시적 보상 함수(implicit reward function)로 활용될 수 있음이 입증되었습니다. 본 연구에서는 LLM을 위한 직접 선호 지식 증류(Direct Preference Knowledge Distillation, DPKD) 방법을 제안합니다. DPKD는 분포 발산을 활용하여 선호 손실(preference loss)과 암시적 보상 함수를 표현합니다. 우리는 LLM의 KD를 두 단계로 재구성했습니다: 첫째, 암시적 보상과 역 KL 발산으로 구성된 목적 함수를 최적화하고, 둘째, 교사 모델의 출력이 학생 모델의 출력보다 선호될 확률을 높이는 것입니다. 120M에서 13B에 이르는 다양한 LLM 파라미터를 사용한 데이터셋에서 실험과 분석을 수행하여 DPKD 접근법의 광범위한 적용 가능성과 효과성을 입증했습니다. 동시에, 실험과 이론적 분석을 통해 도입된 암시적 보상과 출력 선호가 KD에서 가지는 가치와 효과성을 입증했습니다. DPKD 방법은 출력 응답 정확도와 정확 일치 비율(exact match percentage) 모두에서 기준 방법을 능가하는 성능을 보였습니다. 코드와 데이터는 https://aka.ms/dpkd에서 확인할 수 있습니다.

English

In the field of large language models (LLMs), Knowledge Distillation (KD) is a critical technique for transferring capabilities from teacher models to student models. However, existing KD methods face limitations and challenges in distillation of LLMs, including efficiency and insufficient measurement capabilities of traditional KL divergence. It is shown that LLMs can serve as an implicit reward function, which we define as a supplement to KL divergence. In this work, we propose Direct Preference Knowledge Distillation (DPKD) for LLMs. DPKD utilizes distribution divergence to represent the preference loss and implicit reward function. We re-formulate KD of LLMs into two stages: first optimizing and objective consisting of implicit reward and reverse KL divergence and then improving the preference probability of teacher outputs over student outputs. We conducted experiments and analysis on various datasets with LLM parameters ranging from 120M to 13B and demonstrate the broad applicability and effectiveness of our DPKD approach. Meanwhile, we prove the value and effectiveness of the introduced implicit reward and output preference in KD through experiments and theoretical analysis. The DPKD method outperforms the baseline method in both output response precision and exact match percentage. Code and data are available at https://aka.ms/dpkd.

대규모 언어 모델을 위한 직접적 선호 지식 증류

Direct Preference Knowledge Distillation for Large Language Models

초록

Support