面向在线策略蒸馏的信任区域行为融合

摘要

在线策略蒸馏（OPD）通过让学生在基于自身策略采样的前缀上进行训练，同时匹配更强的教师模型，解决了离线蒸馏中的前缀不匹配问题。然而，早期学生轨迹生成的质量仍然可能较差，导致教师监督施加在薄弱或低质量的前缀上。我们提出信任区域行为混合（TRB）——一种预热方法，它用最接近教师的行为策略替换早期的轨迹生成策略，并限定在以学生为中心的KL信任区域内，同时保持每个前缀的反向KL OPD损失不变。KL预算逐渐退火至零，因此预热结束后训练回归到纯学生轨迹生成。在两个数学推理蒸馏设置中，TRB在对比方法中取得了最强的平均表现。

English

On-policy distillation (OPD) trains a student on prefixes sampled from its own policy while matching a stronger teacher. This addresses the prefix mismatch of offline distillation, but early student rollouts can still be poor, placing teacher supervision on weak or low-quality prefixes. We propose Trust-Region behavior Blending (TRB), a warmup method that replaces the early rollout policy with the closest-to-teacher behavior policy inside a student-centered KL trust region, while keeping the per-prefix reverse-KL OPD loss unchanged. The KL budget is annealed to zero, so training returns to pure student rollouts after warmup. Across two math-reasoning distillation settings, TRB attains the strongest average among the compared methods.