DDK：为高效大型语言模型提炼领域知识

摘要

尽管大型语言模型（LLMs）在各种应用中具有先进的智能能力，但它们仍然面临着巨大的计算和存储需求。知识蒸馏（KD）已经成为一种有效的策略，通过从性能优越的大型语言模型（即教师模型）转移知识，来提高较小的LLM（即学生模型）的性能。LLM蒸馏中的主流技术通常使用黑盒模型API生成高质量的预训练和对齐数据集，或者利用白盒蒸馏通过改变损失函数来更好地从教师LLM转移知识。然而，这些方法忽略了学生和教师LLMs之间跨领域的知识差异。这导致过度关注性能差距较小的领域，而对性能差距较大的领域关注不足，从而降低整体性能。在本文中，我们介绍了一种名为DDK的新型LLM蒸馏框架，根据教师和学生模型之间的领域性能差异，动态调整蒸馏数据集的组成，使蒸馏过程更加稳定和有效。广泛的评估表明，DDK显著提高了学生模型的性能，远远优于持续预训练基线和现有的知识蒸馏方法。

English

Despite the advanced intelligence abilities of large language models (LLMs) in various applications, they still face significant computational and storage demands. Knowledge Distillation (KD) has emerged as an effective strategy to improve the performance of a smaller LLM (i.e., the student model) by transferring knowledge from a high-performing LLM (i.e., the teacher model). Prevailing techniques in LLM distillation typically use a black-box model API to generate high-quality pretrained and aligned datasets, or utilize white-box distillation by altering the loss function to better transfer knowledge from the teacher LLM. However, these methods ignore the knowledge differences between the student and teacher LLMs across domains. This results in excessive focus on domains with minimal performance gaps and insufficient attention to domains with large gaps, reducing overall performance. In this paper, we introduce a new LLM distillation framework called DDK, which dynamically adjusts the composition of the distillation dataset in a smooth manner according to the domain performance differences between the teacher and student models, making the distillation process more stable and effective. Extensive evaluations show that DDK significantly improves the performance of student models, outperforming both continuously pretrained baselines and existing knowledge distillation methods by a large margin.

DDK：为高效大型语言模型提炼领域知识

DDK: Distilling Domain Knowledge for Efficient Large Language Models

摘要

Support