언어 모델의 온-폴리시 증류: 자체 생성 오류로부터 학습하기

초록

지식 증류(Knowledge Distillation, KD)는 교사 모델의 추론 비용과 메모리 사용량을 줄이기 위해 더 작은 학생 모델을 훈련시키는 방법으로 널리 사용됩니다. 그러나 자동 회귀 시퀀스 모델에 대한 현재의 KD 방법들은 훈련 중에 보는 출력 시퀀스와 학생 모델이 추론 중에 생성하는 시퀀스 간의 분포 불일치 문제를 겪고 있습니다. 이 문제를 해결하기 위해, 우리는 일반화된 지식 증류(Generalized Knowledge Distillation, GKD)를 소개합니다. GKD는 고정된 출력 시퀀스 집합에만 의존하는 대신, 학생 모델이 스스로 생성한 출력 시퀀스에 대해 교사 모델의 피드백을 활용하여 학생 모델을 훈련시킵니다. 지도 학습 기반의 KD 접근법과 달리, GKD는 학생 모델이 교사 모델의 분포를 모방할 만큼 표현력이 부족한 경우에도 학생과 교사 간의 대체 손실 함수를 유연하게 사용할 수 있습니다. 더욱이, GKD는 강화 학습 미세 조정(RLHF)과의 원활한 통합을 가능하게 합니다. 우리는 요약, 번역, 산술 추론 작업에서 자동 회귀 언어 모델을 증류하는 데 GKD의 효율성을 입증했으며, 지시 튜닝을 위한 작업 독립적 증류에서도 그 효과를 보여줍니다.

English

Knowledge distillation (KD) is widely used for compressing a teacher model to reduce its inference cost and memory footprint, by training a smaller student model. However, current KD methods for auto-regressive sequence models suffer from distribution mismatch between output sequences seen during training and those generated by the student during inference. To address this issue, we introduce Generalized Knowledge Distillation (GKD). Instead of solely relying on a fixed set of output sequences, GKD trains the student on its self-generated output sequences by leveraging feedback from the teacher on such sequences. Unlike supervised KD approaches, GKD also offers the flexibility to employ alternative loss functions between the student and teacher, which can be useful when the student lacks the expressivity to mimic the teacher's distribution. Furthermore, GKD facilitates the seamless integration of distillation with RL fine-tuning (RLHF). We demonstrate the efficacy of GKD for distilling auto-regressive language models on summarization, translation, and arithmetic reasoning tasks, and task-agnostic distillation for instruction-tuning.

언어 모델의 온-폴리시 증류: 자체 생성 오류로부터 학습하기

On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes

초록

Support