大型语言模型的混合策略精馏

摘要

知識蒸餾（KD）作為壓縮大型語言模型（LLM）的有效範式，其效能取決於散度方向、優化策略與數據機制的交織選擇。本文系統性解構現有KD方法的設計思路，提出統一框架建立其內在關聯，將KD重新表述為詞元層級的加權對數似然目標。我們進一步提出混合策略蒸餾（HPD）方法，融合正向與反向KL散度的互補優勢以平衡模式覆蓋與模式搜尋，並結合離線策略數據與輕量級近似在線策略採樣。通過長文本數學推理、短文本對話及代碼生成任務的實驗驗證，HPD在不同模型系列與規模下均展現出更優的優化穩定性、計算效率與最終性能。相關代碼已開源於：https://github.com/zwhong714/Hybrid-Policy-Distillation。

English

Knowledge distillation (KD) is a powerful paradigm for compressing large language models (LLMs), whose effectiveness depends on intertwined choices of divergence direction, optimization strategy, and data regime. We break down the design of existing KD methods and present a unified view that establishes connections between them, reformulating KD as a reweighted log-likelihood objective at the token level. We further propose Hybrid Policy Distillation (HPD), which integrates the complementary advantages of forward and reverse KL to balance mode coverage and mode-seeking, and combines off-policy data with lightweight, approximate on-policy sampling. We validate HPD on long-generation math reasoning as well as short-generation dialogue and code tasks, demonstrating improved optimization stability, computational efficiency, and final performance across diverse model families and scales. The code related to this work is available at https://github.com/zwhong714/Hybrid-Policy-Distillation.

大型语言模型的混合策略精馏

Hybrid Policy Distillation for LLMs

摘要

Support