SG-OPD: 符号一貫性ゲーティングと段階的教師サンプリングによる符号ゲート付きオン方策蒸留

要旨

オンポリシー蒸留（OPD）は、より強力な教師からの密なトークンレベルの教師信号を用いて、生徒自身の軌跡上で生徒を訓練し、多くの場合、オフポリシー蒸留や標準的な強化学習よりも優れた性能を発揮する。しかし、我々はその有効性が、実際には頻繁に破られる二つの仮定に暗黙的に依存していることを発見した。すなわち、生徒と教師の間の軌跡レベルの整合性と、教師の選好の均一なトークンレベルの信頼性である。そこで我々は、Sign-Gated On-Policy Distillation（SG-OPD）を提案する。これは、二値検証器を教師に対する信頼信号として、相補的な二つの粒度で利用する。すなわち、段階的教師サンプリングにより、コールドスタート時に検証器が承認した教師のロールアウトを混入させ、符号一貫性ゲートにより、教師が検証器の正しい方向と一致するトークンでは蒸留更新を外挿し、一致しない場合には内挿する。競技レベルの数学的推論ベンチマークによる実験では、SG-OPDが標準的なOPDを一貫して上回り、サンプルあたり平均1.98、質問あたり平均7.50の改善を示した。

English

On-policy distillation (OPD) trains a student on its own trajectories with dense per-token supervision from a stronger teacher, and often outperforms off-policy distillation and standard reinforcement learning. However, we find that its effectiveness implicitly relies on two assumptions that frequently break in practice: trajectory-level alignment between the student and the teacher, and uniform token-level reliability of the teacher's preferences. We therefore propose Sign-Gated On-Policy Distillation (SG-OPD), which uses a binary verifier as a trust signal for the teacher at two complementary granularities: phased teacher sampling mixes in verifier-endorsed teacher rollouts at cold-start, and a sign-consistency gate extrapolates the distillation update on tokens where the teacher agrees with the verifier-correct direction and interpolates it where it disagrees. Experiments on competition-level mathematical reasoning benchmarks show that SG-OPD consistently outperforms standard OPD, with average gains of 1.98 and 7.50 at the per-sample and per-question levels, respectively.