自己蒸留方策勾配

要旨

オン方策自己蒸留（言語モデルが特権的文脈に条件付けを行い、自身の生成を監督する手法）は、疎報酬強化学習における密度の高い監督信号の有望な供給源である。実際、これは補助的な全語彙生徒-教師間逆KLダイバージェンス損失として具体化できる。そこで我々はSDPG（自己蒸留方策勾配フレームワーク）を提案する。SDPGは、グループ相対検証器アドバンテージと正規化標準偏差、正確な全語彙オン方策自己蒸留、さらに参照方策KL正則化を組み合わせる。実験的に、SDPGはRLVRおよび自己蒸留ベースラインと比較して安定性と性能を向上させる。コードはhttps://github.com/lauyikfung/SDPGで入手可能である。

English

On-policy self-distillation, where a language model conditions on privileged context to supervise its own generations, is a promising source of dense supervision for sparse-reward reinforcement learning. Actually, it can be instantiated as an auxiliary full-vocabulary student-to-teacher reverse Kullback-Leibler divergence loss. We therefore propose SDPG, a self-distilled policy-gradient framework that combines group-relative verifier advantages with normalized standard deviation, exact full-vocabulary on-policy self-distillation, as well as reference-policy KL regularization. Empirically, SDPG improves stability and performance over RLVR and self-distillation baselines. The code is available at https://github.com/lauyikfung/SDPG.