信任區域在策略蒸餾

摘要

在策略蒸餾（On-Policy Distillation, OPD）是一種用於大型語言模型（LLMs）高效後訓練的基礎技術，在智能體學習、多任務增強與模型壓縮等領域具有廣泛應用。然而，當教師與學生的分佈出現顯著差異時，OPD訓練會變得不穩定，因為教師對學生生成令牌的監督可能產生不可靠的策略梯度，甚至導致優化失敗。本研究透過信用分配策略來解決可靠的逐令牌在策略監督問題，並提出信任區域在策略蒸餾（Trust Region On-Policy Distillation, TrOPD）。它具有以下特性：1）信任區域在策略學習：TrOPD僅在教師提供可靠監督的區域執行OPD，從而緩解在分佈不匹配情況下K1反向KL估計器的優化困難。2）異常值估計：針對異常區域，我們探索梯度裁剪、遮罩以及前向KL估計，以減少不可靠監督的不利影響。3）離策略引導：學生從教師前綴繼續生成，並使用前向KL來模仿離策略引導，鼓勵向可靠區域進行在策略探索。實驗結果表明，TrOPD在數學推理、程式碼生成以及通用領域基準測試中，始終優於包括OPD、EOPD與REOPOLD在內的最先進OPD基線。

English

On-Policy Distillation (OPD) is a fundamental technique for efficient post-training of large language models (LLMs), with broad applications in agent learning, multi-task enhancement, and model compression. However, OPD training becomes unstable when the teacher and student distributions differ substantially, as teacher supervision on student-generated tokens may yield unreliable policy gradients and even cause optimization failure. This work addresses reliable on-policy token-level supervision through credit assignment strategies, and proposes Trust Region On-Policy Distillation, TrOPD. It features the following characteristics: 1) Trust-Region On-Policy Learning: TrOPD performs OPD only in regions where the teacher provides reliable supervision, mitigating the optimization difficulty of the K1 reverse-KL estimator under distribution mismatch. 2) Outlier Estimation: For outlier regions, we explore gradient clipping, masking, and forward-KL estimation to reduce the adverse effects of unreliable supervision. 3) Off-Policy Guidance: The student continues generation from teacher prefixes and uses forward KL to imitate off-policy guidance, encouraging on-policy exploration toward reliable regions. Experiments show that TrOPD consistently outperforms SoTA OPD baselines, including OPD, EOPD, and REOPOLD, across mathematical reasoning, code generation, and general-domain benchmarks.