SG-OPD：基于符号一致性门控与分阶段教师采样的符号门控在策略蒸馏

摘要

同策略蒸馏（OPD）通过利用更强的教师模型对学生自身轨迹进行密集的逐令牌监督来训练学生，其表现通常优于离策略蒸馏和标准强化学习。然而，我们发现其有效性隐式地依赖于两个在实践中常被违背的假设：学生与教师之间的轨迹级对齐，以及教师偏好均匀的令牌级可靠性。因此，我们提出符号一致性门控同策略蒸馏（SG-OPD），该方法在两种互补粒度上使用二元验证器作为教师信任信号：在冷启动阶段，分阶段教师采样混入经验证器认可的教师轨迹；而符号一致性门控在令牌方向上，当教师与验证器修正方向一致时外推蒸馏更新，反之则进行插值。在竞赛级数学推理基准上的实验表明，SG-OPD 始终优于标准 OPD，在逐样本和逐问题层面分别获得平均 1.98 和 7.50 的提升。

English

On-policy distillation (OPD) trains a student on its own trajectories with dense per-token supervision from a stronger teacher, and often outperforms off-policy distillation and standard reinforcement learning. However, we find that its effectiveness implicitly relies on two assumptions that frequently break in practice: trajectory-level alignment between the student and the teacher, and uniform token-level reliability of the teacher's preferences. We therefore propose Sign-Gated On-Policy Distillation (SG-OPD), which uses a binary verifier as a trust signal for the teacher at two complementary granularities: phased teacher sampling mixes in verifier-endorsed teacher rollouts at cold-start, and a sign-consistency gate extrapolates the distillation update on tokens where the teacher agrees with the verifier-correct direction and interpolates it where it disagrees. Experiments on competition-level mathematical reasoning benchmarks show that SG-OPD consistently outperforms standard OPD, with average gains of 1.98 and 7.50 at the per-sample and per-question levels, respectively.