信頼領域オンポリシー蒸留

要旨

オン方策蒸留（OPD）は、大規模言語モデル（LLM）の効率的なポストトレーニングのための基本的手法であり、エージェント学習、マルチタスク強化、モデル圧縮に広く応用される。しかし、教師と生徒の分布が大きく異なる場合、OPD学習は不安定になる。これは、生徒が生成したトークンに対する教師の監督が信頼できない方策勾配をもたらし、最適化の失敗を引き起こす可能性があるためである。本研究では、クレジット割り当て戦略を通じて信頼性の高いオン方策トークンレベル監督を実現し、Trust Region On-Policy Distillation（TrOPD）を提案する。TrOPDは以下の特徴を持つ。1）信頼領域オン方策学習：TrOPDは、教師が信頼性の高い監督を提供できる領域でのみOPDを実行し、分布ミスマッチ下でのK1逆方向KL推定器の最適化の困難を軽減する。2）外れ値推定：外れ値領域に対しては、勾配クリッピング、マスキング、順方向KL推定を調査し、信頼できない監督の悪影響を低減する。3）オフ方策ガイダンス：生徒は教師のプレフィックスから生成を継続し、順方向KLを用いてオフ方策ガイダンスを模倣することで、信頼できる領域へのオン方策探索を促進する。実験結果は、TrOPDが数学的推論、コード生成、汎用ドメインベンチマークにおいて、OPD、EOPD、REOPOLDなどの最先端OPDベースラインを一貫して上回ることを示している。

English

On-Policy Distillation (OPD) is a fundamental technique for efficient post-training of large language models (LLMs), with broad applications in agent learning, multi-task enhancement, and model compression. However, OPD training becomes unstable when the teacher and student distributions differ substantially, as teacher supervision on student-generated tokens may yield unreliable policy gradients and even cause optimization failure. This work addresses reliable on-policy token-level supervision through credit assignment strategies, and proposes Trust Region On-Policy Distillation, TrOPD. It features the following characteristics: 1) Trust-Region On-Policy Learning: TrOPD performs OPD only in regions where the teacher provides reliable supervision, mitigating the optimization difficulty of the K1 reverse-KL estimator under distribution mismatch. 2) Outlier Estimation: For outlier regions, we explore gradient clipping, masking, and forward-KL estimation to reduce the adverse effects of unreliable supervision. 3) Off-Policy Guidance: The student continues generation from teacher prefixes and uses forward KL to imitate off-policy guidance, encouraging on-policy exploration toward reliable regions. Experiments show that TrOPD consistently outperforms SoTA OPD baselines, including OPD, EOPD, and REOPOLD, across mathematical reasoning, code generation, and general-domain benchmarks.