SG-OPD: 부호 일관성 게이팅과 단계적 교사 샘플링을 통한 부호 게이트 온-정책 증류

초록

온-정책 증류(On-policy distillation, OPD)는 더 강력한 교사로부터의 밀집된 토큰별 감독을 활용하여 학생이 자체 궤적에서 학습하는 방식으로, 오프-정책 증류 및 표준 강화 학습보다 우수한 성능을 보이는 경우가 많다. 그러나 그 효과성은 실제로 자주 위반되는 두 가지 가정, 즉 학생과 교사 간의 궤적 수준 정렬과 교사 선호도의 균일한 토큰 수준 신뢰성에 암묵적으로 의존한다는 점을 발견했다. 이에 본 연구에서는 보완적인 두 가지 세분화 수준에서 이진 검증기를 교사에 대한 신뢰 신호로 사용하는 부호 일관성 게이트 온-정책 증류(Sign-Gated On-Policy Distillation, SG-OPD)를 제안한다. 단계적 교사 샘플링은 콜드 스타트 시 검증기가 승인한 교사 롤아웃을 혼합하며, 부호 일관성 게이트는 교사가 검증기-수정 방향에 동의하는 토큰에서는 증류 업데이트를 외삽하고, 동의하지 않는 토큰에서는 내삽한다. 경쟁 수준의 수학적 추론 벤치마크 실험에서 SG-OPD는 표준 OPD를 일관되게 능가하며, 표본별 및 질문별 수준에서 각각 평균 1.98 및 7.50의 향상을 보였다.

English

On-policy distillation (OPD) trains a student on its own trajectories with dense per-token supervision from a stronger teacher, and often outperforms off-policy distillation and standard reinforcement learning. However, we find that its effectiveness implicitly relies on two assumptions that frequently break in practice: trajectory-level alignment between the student and the teacher, and uniform token-level reliability of the teacher's preferences. We therefore propose Sign-Gated On-Policy Distillation (SG-OPD), which uses a binary verifier as a trust signal for the teacher at two complementary granularities: phased teacher sampling mixes in verifier-endorsed teacher rollouts at cold-start, and a sign-consistency gate extrapolates the distillation update on tokens where the teacher agrees with the verifier-correct direction and interpolates it where it disagrees. Experiments on competition-level mathematical reasoning benchmarks show that SG-OPD consistently outperforms standard OPD, with average gains of 1.98 and 7.50 at the per-sample and per-question levels, respectively.