자기 증류 정책 그래디언트

초록

온-정책 자기 증류(on-policy self-distillation)는 언어 모델이 특권 맥락(privileged context)을 조건으로 하여 자신의 생성 결과를 감독하는 방식으로, 희소 보상 강화 학습(sparse-reward reinforcement learning)에 대한 조밀한 감독의 유망한 원천이다. 실제로 이는 보조적인 전체 어휘 학생-교사 역방향 쿨백-라이블러 발산 손실(full-vocabulary student-to-teacher reverse Kullback-Leibler divergence loss)로 구현될 수 있다. 이에 본 연구에서는 SDPG(self-distilled policy-gradient) 프레임워크를 제안한다. SDPG는 그룹 상대 검증기 이점(group-relative verifier advantages)과 정규화된 표준 편차(normalized standard deviation), 정확한 전체 어휘 온-정책 자기 증독, 그리고 참조 정책 KL 정규화(reference-policy KL regularization)를 결합한다. 실험적으로 SDPG는 RLVR 및 자기 증류 기준선(baseline) 대비 안정성과 성능을 향상시킨다. 코드는 https://github.com/lauyikfung/SDPG에서 확인할 수 있다.

English

On-policy self-distillation, where a language model conditions on privileged context to supervise its own generations, is a promising source of dense supervision for sparse-reward reinforcement learning. Actually, it can be instantiated as an auxiliary full-vocabulary student-to-teacher reverse Kullback-Leibler divergence loss. We therefore propose SDPG, a self-distilled policy-gradient framework that combines group-relative verifier advantages with normalized standard deviation, exact full-vocabulary on-policy self-distillation, as well as reference-policy KL regularization. Empirically, SDPG improves stability and performance over RLVR and self-distillation baselines. The code is available at https://github.com/lauyikfung/SDPG.