自蒸馏策略梯度

摘要

在策略自蒸馏中，语言模型基于特权上下文来监督自身生成过程，这为稀疏奖励强化学习提供了密集监督的潜在来源。实际上，该过程可实例化为一种辅助性的全词汇学生到教师的逆库尔巴克-莱布勒散度损失函数。为此，我们提出了SDPG——一种自蒸馏策略梯度框架，该框架结合了组相对验证器优势、归一化标准差、精确的全词汇在策略自蒸馏以及参考策略KL正则化。实验表明，SDPG在稳定性和性能上均优于RLVR和自蒸馏基线方法。代码已开源至 https://github.com/lauyikfung/SDPG。

English

On-policy self-distillation, where a language model conditions on privileged context to supervise its own generations, is a promising source of dense supervision for sparse-reward reinforcement learning. Actually, it can be instantiated as an auxiliary full-vocabulary student-to-teacher reverse Kullback-Leibler divergence loss. We therefore propose SDPG, a self-distilled policy-gradient framework that combines group-relative verifier advantages with normalized standard deviation, exact full-vocabulary on-policy self-distillation, as well as reference-policy KL regularization. Empirically, SDPG improves stability and performance over RLVR and self-distillation baselines. The code is available at https://github.com/lauyikfung/SDPG.