大型语言模型的混合策略蒸馏

摘要

知识蒸馏（KD）是一种压缩大语言模型（LLM）的强大范式，其效果取决于散度方向、优化策略和数据机制之间的交织选择。我们系统剖析了现有KD方法的设计思路，提出统一框架建立其内在联系，将KD重新表述为词元级别的加权对数似然目标。进一步提出混合策略蒸馏（HPD）方法，集成正向KL与反向KL的互补优势以平衡模式覆盖与模式搜寻，并将离线数据与轻量级近似在线采样相结合。我们在长文本数学推理、短文本对话和代码生成任务上验证HPD，证明其能提升优化稳定性、计算效率及最终性能，且适用于不同模型家族与规模。相关代码已开源：https://github.com/zwhong714/Hybrid-Policy-Distillation。

English

Knowledge distillation (KD) is a powerful paradigm for compressing large language models (LLMs), whose effectiveness depends on intertwined choices of divergence direction, optimization strategy, and data regime. We break down the design of existing KD methods and present a unified view that establishes connections between them, reformulating KD as a reweighted log-likelihood objective at the token level. We further propose Hybrid Policy Distillation (HPD), which integrates the complementary advantages of forward and reverse KL to balance mode coverage and mode-seeking, and combines off-policy data with lightweight, approximate on-policy sampling. We validate HPD on long-generation math reasoning as well as short-generation dialogue and code tasks, demonstrating improved optimization stability, computational efficiency, and final performance across diverse model families and scales. The code related to this work is available at https://github.com/zwhong714/Hybrid-Policy-Distillation.

大型语言模型的混合策略蒸馏

Hybrid Policy Distillation for LLMs

摘要

Support