ChatPaper.aiChatPaper

信任區域在策略蒸餾

Trust Region On-Policy Distillation

May 31, 2026
作者: Xingrun Xing, Haoqing Wang, Boyan Gao, Ziheng Li, Yehui Tang
cs.AI

摘要

在策略蒸餾(On-Policy Distillation, OPD)是一種用於大型語言模型(LLMs)高效後訓練的基礎技術,在智能體學習、多任務增強與模型壓縮等領域具有廣泛應用。然而,當教師與學生的分佈出現顯著差異時,OPD訓練會變得不穩定,因為教師對學生生成令牌的監督可能產生不可靠的策略梯度,甚至導致優化失敗。本研究透過信用分配策略來解決可靠的逐令牌在策略監督問題,並提出信任區域在策略蒸餾(Trust Region On-Policy Distillation, TrOPD)。它具有以下特性:1)信任區域在策略學習:TrOPD僅在教師提供可靠監督的區域執行OPD,從而緩解在分佈不匹配情況下K1反向KL估計器的優化困難。2)異常值估計:針對異常區域,我們探索梯度裁剪、遮罩以及前向KL估計,以減少不可靠監督的不利影響。3)離策略引導:學生從教師前綴繼續生成,並使用前向KL來模仿離策略引導,鼓勵向可靠區域進行在策略探索。實驗結果表明,TrOPD在數學推理、程式碼生成以及通用領域基準測試中,始終優於包括OPD、EOPD與REOPOLD在內的最先進OPD基線。
English
On-Policy Distillation (OPD) is a fundamental technique for efficient post-training of large language models (LLMs), with broad applications in agent learning, multi-task enhancement, and model compression. However, OPD training becomes unstable when the teacher and student distributions differ substantially, as teacher supervision on student-generated tokens may yield unreliable policy gradients and even cause optimization failure. This work addresses reliable on-policy token-level supervision through credit assignment strategies, and proposes Trust Region On-Policy Distillation, TrOPD. It features the following characteristics: 1) Trust-Region On-Policy Learning: TrOPD performs OPD only in regions where the teacher provides reliable supervision, mitigating the optimization difficulty of the K1 reverse-KL estimator under distribution mismatch. 2) Outlier Estimation: For outlier regions, we explore gradient clipping, masking, and forward-KL estimation to reduce the adverse effects of unreliable supervision. 3) Off-Policy Guidance: The student continues generation from teacher prefixes and uses forward KL to imitate off-policy guidance, encouraging on-policy exploration toward reliable regions. Experiments show that TrOPD consistently outperforms SoTA OPD baselines, including OPD, EOPD, and REOPOLD, across mathematical reasoning, code generation, and general-domain benchmarks.