信任区域同策略蒸馏

摘要

策略蒸馏（On-Policy Distillation, OPD）是大语言模型（LLMs）高效后训练的基础技术，在智能体学习、多任务增强和模型压缩中具有广泛应用。然而，当教师模型与学生模型的分布存在显著差异时，OPD训练会变得不稳定——教师对学生生成token的监督可能产生不可靠的策略梯度，甚至导致优化失败。本研究通过信用分配策略解决可靠的策略级token监督问题，提出信任区域策略蒸馏（Trust Region On-Policy Distillation, TrOPD）。其核心特性包括：1）信任区域策略学习：TrOPD仅在教师提供可靠监督的区域执行OPD，缓解分布不匹配下K1逆KL估计器的优化困难；2）离群估计：针对离群区域，我们探索梯度裁剪、掩码和正向KL估计等方法，减少不可靠监督的不利影响；3）离策略引导：学生从教师前缀继续生成，并使用正向KL模仿离策略引导，促进向可靠区域的策略探索。实验表明，TrOPD在数学推理、代码生成和通用领域基准测试中持续优于当前最优的OPD基线方法，包括OPD、EOPD和REOPOLD。

English

On-Policy Distillation (OPD) is a fundamental technique for efficient post-training of large language models (LLMs), with broad applications in agent learning, multi-task enhancement, and model compression. However, OPD training becomes unstable when the teacher and student distributions differ substantially, as teacher supervision on student-generated tokens may yield unreliable policy gradients and even cause optimization failure. This work addresses reliable on-policy token-level supervision through credit assignment strategies, and proposes Trust Region On-Policy Distillation, TrOPD. It features the following characteristics: 1) Trust-Region On-Policy Learning: TrOPD performs OPD only in regions where the teacher provides reliable supervision, mitigating the optimization difficulty of the K1 reverse-KL estimator under distribution mismatch. 2) Outlier Estimation: For outlier regions, we explore gradient clipping, masking, and forward-KL estimation to reduce the adverse effects of unreliable supervision. 3) Off-Policy Guidance: The student continues generation from teacher prefixes and uses forward KL to imitate off-policy guidance, encouraging on-policy exploration toward reliable regions. Experiments show that TrOPD consistently outperforms SoTA OPD baselines, including OPD, EOPD, and REOPOLD, across mathematical reasoning, code generation, and general-domain benchmarks.