DDK:为高效大型语言模型提炼领域知识
DDK: Distilling Domain Knowledge for Efficient Large Language Models
July 23, 2024
作者: Jiaheng Liu, Chenchen Zhang, Jinyang Guo, Yuanxing Zhang, Haoran Que, Ken Deng, Zhiqi Bai, Jie Liu, Ge Zhang, Jiakai Wang, Yanan Wu, Congnan Liu, Wenbo Su, Jiamang Wang, Lin Qu, Bo Zheng
cs.AI
摘要
尽管大型语言模型(LLMs)在各种应用中具有先进的智能能力,但它们仍然面临着巨大的计算和存储需求。知识蒸馏(KD)已经成为一种有效的策略,通过从性能优越的大型语言模型(即教师模型)转移知识,来提高较小的LLM(即学生模型)的性能。LLM蒸馏中的主流技术通常使用黑盒模型API生成高质量的预训练和对齐数据集,或者利用白盒蒸馏通过改变损失函数来更好地从教师LLM转移知识。然而,这些方法忽略了学生和教师LLMs之间跨领域的知识差异。这导致过度关注性能差距较小的领域,而对性能差距较大的领域关注不足,从而降低整体性能。在本文中,我们介绍了一种名为DDK的新型LLM蒸馏框架,根据教师和学生模型之间的领域性能差异,动态调整蒸馏数据集的组成,使蒸馏过程更加稳定和有效。广泛的评估表明,DDK显著提高了学生模型的性能,远远优于持续预训练基线和现有的知识蒸馏方法。
English
Despite the advanced intelligence abilities of large language models (LLMs)
in various applications, they still face significant computational and storage
demands. Knowledge Distillation (KD) has emerged as an effective strategy to
improve the performance of a smaller LLM (i.e., the student model) by
transferring knowledge from a high-performing LLM (i.e., the teacher model).
Prevailing techniques in LLM distillation typically use a black-box model API
to generate high-quality pretrained and aligned datasets, or utilize white-box
distillation by altering the loss function to better transfer knowledge from
the teacher LLM. However, these methods ignore the knowledge differences
between the student and teacher LLMs across domains. This results in excessive
focus on domains with minimal performance gaps and insufficient attention to
domains with large gaps, reducing overall performance. In this paper, we
introduce a new LLM distillation framework called DDK, which dynamically
adjusts the composition of the distillation dataset in a smooth manner
according to the domain performance differences between the teacher and student
models, making the distillation process more stable and effective. Extensive
evaluations show that DDK significantly improves the performance of student
models, outperforming both continuously pretrained baselines and existing
knowledge distillation methods by a large margin.Summary
AI-Generated Summary