온-정책 증류를 위한 신뢰 영역 행동 블렌딩

초록

온-정책 증류(OPD)는 학생 정책에서 샘플링한 접두사에 대해 학생을 훈련시키면서 더 강한 교사와 매칭하는 방법이다. 이는 오프라인 증류의 접두사 불일치 문제를 해결하지만, 초기 학생 롤아웃은 여전히 부실할 수 있어 교사 감독이 취약하거나 낮은 품질의 접두사에 배치된다. 본 논문에서는 신뢰 영역 행동 혼합(TRB)이라는 워밍업 방법을 제안한다. 이 방법은 학생 중심의 KL 신뢰 영역 내에서 초기 롤아웃 정책을 교사에 가장 가까운 행동 정책으로 대체하되, 접두사별 역방향 KL OPD 손실은 변경하지 않는다. KL 예산은 서서히 0으로 감소하므로 워밍업 후에는 순수 학생 롤아웃으로 훈련이 복귀된다. 두 가지 수학 추론 증류 설정에서 TRB는 비교된 방법들 중 가장 강력한 평균 성능을 달성했다.

English

On-policy distillation (OPD) trains a student on prefixes sampled from its own policy while matching a stronger teacher. This addresses the prefix mismatch of offline distillation, but early student rollouts can still be poor, placing teacher supervision on weak or low-quality prefixes. We propose Trust-Region behavior Blending (TRB), a warmup method that replaces the early rollout policy with the closest-to-teacher behavior policy inside a student-centered KL trust region, while keeping the per-prefix reverse-KL OPD loss unchanged. The KL budget is annealed to zero, so training returns to pure student rollouts after warmup. Across two math-reasoning distillation settings, TRB attains the strongest average among the compared methods.