ChatPaper.aiChatPaper

信任区域同策略蒸馏

Trust Region On-Policy Distillation

May 31, 2026
作者: Xingrun Xing, Haoqing Wang, Boyan Gao, Ziheng Li, Yehui Tang
cs.AI

摘要

策略蒸馏(On-Policy Distillation, OPD)是大语言模型(LLMs)高效后训练的基础技术,在智能体学习、多任务增强和模型压缩中具有广泛应用。然而,当教师模型与学生模型的分布存在显著差异时,OPD训练会变得不稳定——教师对学生生成token的监督可能产生不可靠的策略梯度,甚至导致优化失败。本研究通过信用分配策略解决可靠的策略级token监督问题,提出信任区域策略蒸馏(Trust Region On-Policy Distillation, TrOPD)。其核心特性包括:1)信任区域策略学习:TrOPD仅在教师提供可靠监督的区域执行OPD,缓解分布不匹配下K1逆KL估计器的优化困难;2)离群估计:针对离群区域,我们探索梯度裁剪、掩码和正向KL估计等方法,减少不可靠监督的不利影响;3)离策略引导:学生从教师前缀继续生成,并使用正向KL模仿离策略引导,促进向可靠区域的策略探索。实验表明,TrOPD在数学推理、代码生成和通用领域基准测试中持续优于当前最优的OPD基线方法,包括OPD、EOPD和REOPOLD。
English
On-Policy Distillation (OPD) is a fundamental technique for efficient post-training of large language models (LLMs), with broad applications in agent learning, multi-task enhancement, and model compression. However, OPD training becomes unstable when the teacher and student distributions differ substantially, as teacher supervision on student-generated tokens may yield unreliable policy gradients and even cause optimization failure. This work addresses reliable on-policy token-level supervision through credit assignment strategies, and proposes Trust Region On-Policy Distillation, TrOPD. It features the following characteristics: 1) Trust-Region On-Policy Learning: TrOPD performs OPD only in regions where the teacher provides reliable supervision, mitigating the optimization difficulty of the K1 reverse-KL estimator under distribution mismatch. 2) Outlier Estimation: For outlier regions, we explore gradient clipping, masking, and forward-KL estimation to reduce the adverse effects of unreliable supervision. 3) Off-Policy Guidance: The student continues generation from teacher prefixes and uses forward KL to imitate off-policy guidance, encouraging on-policy exploration toward reliable regions. Experiments show that TrOPD consistently outperforms SoTA OPD baselines, including OPD, EOPD, and REOPOLD, across mathematical reasoning, code generation, and general-domain benchmarks.