用於同策略蒸餾的信賴區域行為融合

摘要

在線策略蒸餾（OPD）讓學生模型以其自身策略取樣的前綴進行訓練，同時與更強老師模型的行為對齊。這解決了離線蒸餾中的前綴不匹配問題，但早期的學生模型展開仍可能效果不佳，使得老師監督作用在薄弱或低品質的前綴上。我們提出信任區域行為混合（Trust-Region behavior Blending, TRB），這是一種預熱方法，在學生為中心的KL信任區域內，將早期展開策略替換為最接近老師的行為策略，同時保持每個前綴的逆KL OPD損失不變。KL預算會逐步退火至零，因此在預熱之後訓練會回歸純粹的學生模型展開。在兩個數學推理蒸餾設定中，TRB在比較方法中達到了最強的平均表現。

English

On-policy distillation (OPD) trains a student on prefixes sampled from its own policy while matching a stronger teacher. This addresses the prefix mismatch of offline distillation, but early student rollouts can still be poor, placing teacher supervision on weak or low-quality prefixes. We propose Trust-Region behavior Blending (TRB), a warmup method that replaces the early rollout policy with the closest-to-teacher behavior policy inside a student-centered KL trust region, while keeping the per-prefix reverse-KL OPD loss unchanged. The KL budget is annealed to zero, so training returns to pure student rollouts after warmup. Across two math-reasoning distillation settings, TRB attains the strongest average among the compared methods.