ChatPaper.aiChatPaper

面向在线策略蒸馏的信任区域行为融合

Trust-Region Behavior Blending for On-Policy Distillation

May 29, 2026
作者: Daniil Plyusov, Alexey Gorbatovski, Alexey Malakhov, Nikita Balagansky, Boris Shaposhnikov, Daria Korotyshova, Daniil Gavrilov
cs.AI

摘要

在线策略蒸馏(OPD)通过让学生在基于自身策略采样的前缀上进行训练,同时匹配更强的教师模型,解决了离线蒸馏中的前缀不匹配问题。然而,早期学生轨迹生成的质量仍然可能较差,导致教师监督施加在薄弱或低质量的前缀上。我们提出信任区域行为混合(TRB)——一种预热方法,它用最接近教师的行为策略替换早期的轨迹生成策略,并限定在以学生为中心的KL信任区域内,同时保持每个前缀的反向KL OPD损失不变。KL预算逐渐退火至零,因此预热结束后训练回归到纯学生轨迹生成。在两个数学推理蒸馏设置中,TRB在对比方法中取得了最强的平均表现。
English
On-policy distillation (OPD) trains a student on prefixes sampled from its own policy while matching a stronger teacher. This addresses the prefix mismatch of offline distillation, but early student rollouts can still be poor, placing teacher supervision on weak or low-quality prefixes. We propose Trust-Region behavior Blending (TRB), a warmup method that replaces the early rollout policy with the closest-to-teacher behavior policy inside a student-centered KL trust region, while keeping the per-prefix reverse-KL OPD loss unchanged. The KL budget is annealed to zero, so training returns to pure student rollouts after warmup. Across two math-reasoning distillation settings, TRB attains the strongest average among the compared methods.