用於同策略蒸餾的信賴區域行為融合
Trust-Region Behavior Blending for On-Policy Distillation
May 29, 2026
作者: Daniil Plyusov, Alexey Gorbatovski, Alexey Malakhov, Nikita Balagansky, Boris Shaposhnikov, Daria Korotyshova, Daniil Gavrilov
cs.AI
摘要
在線策略蒸餾(OPD)讓學生模型以其自身策略取樣的前綴進行訓練,同時與更強老師模型的行為對齊。這解決了離線蒸餾中的前綴不匹配問題,但早期的學生模型展開仍可能效果不佳,使得老師監督作用在薄弱或低品質的前綴上。我們提出信任區域行為混合(Trust-Region behavior Blending, TRB),這是一種預熱方法,在學生為中心的KL信任區域內,將早期展開策略替換為最接近老師的行為策略,同時保持每個前綴的逆KL OPD損失不變。KL預算會逐步退火至零,因此在預熱之後訓練會回歸純粹的學生模型展開。在兩個數學推理蒸餾設定中,TRB在比較方法中達到了最強的平均表現。
English
On-policy distillation (OPD) trains a student on prefixes sampled from its own policy while matching a stronger teacher. This addresses the prefix mismatch of offline distillation, but early student rollouts can still be poor, placing teacher supervision on weak or low-quality prefixes. We propose Trust-Region behavior Blending (TRB), a warmup method that replaces the early rollout policy with the closest-to-teacher behavior policy inside a student-centered KL trust region, while keeping the per-prefix reverse-KL OPD loss unchanged. The KL budget is annealed to zero, so training returns to pure student rollouts after warmup. Across two math-reasoning distillation settings, TRB attains the strongest average among the compared methods.