ChatPaper.aiChatPaper

自我蒸餾策略梯度

Self-Distilled Policy Gradient

June 2, 2026
作者: Yifeng Liu, Shiyuan Zhang, Yifan Zhang, Quanquan Gu
cs.AI

摘要

同策略自蒸餾(語言模型依賴於特權上下文來監督自身生成)為稀疏獎勵強化學習提供了一種有前景的密集監督來源。實際上,該方法可實例化為一種輔助的全詞彙學生對教師反向庫爾貝克-萊布勒散度損失。因此,我們提出SDPG,一種自蒸餾策略梯度框架,該框架將群組相對驗證器優勢與歸一化標準差、精確的全詞彙同策略自蒸餾以及參考策略KL正則化相結合。實驗結果表明,SDPG在穩定性和性能上優於RLVR及自蒸餾基準方法。代碼已開源於 https://github.com/lauyikfung/SDPG。
English
On-policy self-distillation, where a language model conditions on privileged context to supervise its own generations, is a promising source of dense supervision for sparse-reward reinforcement learning. Actually, it can be instantiated as an auxiliary full-vocabulary student-to-teacher reverse Kullback-Leibler divergence loss. We therefore propose SDPG, a self-distilled policy-gradient framework that combines group-relative verifier advantages with normalized standard deviation, exact full-vocabulary on-policy self-distillation, as well as reference-policy KL regularization. Empirically, SDPG improves stability and performance over RLVR and self-distillation baselines. The code is available at https://github.com/lauyikfung/SDPG.