大型语言模型的预训练蒸馏：设计空间探索

摘要

知识蒸馏（KD）旨在将知识从大型教师模型转移至较小的学生模型。先前在大型语言模型（LLMs）领域应用KD的研究通常集中在训练后阶段，学生LLM直接从教师模型生成的指令和相应响应中学习。本文将KD扩展到LLMs的预训练阶段，命名为预训练蒸馏（PD）。我们首先进行了一项初步实验，使用GLM-4-9B作为教师LLM，对一个包含1.9B参数的学生LLM进行蒸馏，验证了PD的有效性。考虑到蒸馏的关键影响因素，我们系统地探索了预训练蒸馏的设计空间，涵盖了四个方面：logits处理、损失选择、缩放定律以及离线或在线logits。我们进行了大量实验来探索预训练蒸馏的设计空间，并找到了更好的配置和有趣的结论，例如更大的学生LLM通常更多受益于预训练蒸馏，而更大的教师LLM并不一定能够保证更好的结果。我们希望我们对设计空间的探索能够为未来的预训练蒸馏实践提供参考。

English

Knowledge distillation (KD) aims to transfer knowledge from a large teacher model to a smaller student model. Previous work applying KD in the field of large language models (LLMs) typically focused on the post-training phase, where the student LLM learns directly from instructions and corresponding responses generated by the teacher model. In this paper, we extend KD to the pre-training phase of LLMs, named pre-training distillation (PD). We first conduct a preliminary experiment using GLM-4-9B as the teacher LLM to distill a 1.9B parameter student LLM, validating the effectiveness of PD. Considering the key impact factors of distillation, we systematically explore the design space of pre-training distillation across four aspects: logits processing, loss selection, scaling law, and offline or online logits. We conduct extensive experiments to explore the design space of pre-training distillation and find better configurations and interesting conclusions, such as larger student LLMs generally benefiting more from pre-training distillation, while a larger teacher LLM does not necessarily guarantee better results. We hope our exploration of the design space will inform future practices in pre-training distillation.

大型语言模型的预训练蒸馏：设计空间探索

Pre-training Distillation for Large Language Models: A Design Space Exploration

摘要

Support