自己蒸留強化学習ベースのビデオ修復

要旨

オン方針蒸留（OPD）は、大規模言語モデルコミュニティにおいて広く採用されている学習パラダイムとなっている。このパラダイムでは、より大規模なモデルを教師として選択し、各サンプリング軌道に対して密で細かな信号を提供する。これに対し、検証可能な報酬を用いた強化学習（RLVR）は、環境からの検証可能な結果から得られる疎な信号のみを利用する。最近では、同一モデルが教師と生徒の両方を担うオン方針自己蒸留（OPSD）が探求されている。この手法では、教師モデルが参照回答などの特権情報を付与されることで自己進化が可能となる。本論文では、特権教師モデルからのみ導出される学習信号では、深刻な情報漏洩と長期的な学習不安定化が生じることを実証する。これに基づき、自己蒸留の最適な適用領域を特定し、RLSD（自己蒸留を組み込んだRLVR）を提案する。具体的には、自己蒸留を活用してトークンレベルの方策差を取得し、細粒度の更新量を決定する。一方で、環境フィードバック（例：応答の正否）から得られる信頼性の高い更新方向はRLVRで継続的に導出する。これによりRLSDはRLVRとOPSDの両方の利点を同時に活かし、より高い収束上限と優れた学習安定性を実現する。

English

On-policy distillation (OPD) has become a popular training paradigm in the LLM community. This paradigm selects a larger model as the teacher to provide dense, fine-grained signals for each sampled trajectory, in contrast to reinforcement learning with verifiable rewards (RLVR), which only obtains sparse signals from verifiable outcomes in the environment. Recently, the community has explored on-policy self-distillation (OPSD), where the same model serves as both teacher and student, with the teacher receiving additional privileged information such as reference answers to enable self-evolution. This paper demonstrates that learning signals solely derived from the privileged teacher result in severe information leakage and unstable long-term training. Accordingly, we identify the optimal niche for self-distillation and propose RLSD (RLVR with Self-Distillation). Specifically, we leverage self-distillation to obtain token-level policy differences for determining fine-grained update magnitudes, while continuing to use RLVR to derive reliable update directions from environmental feedback (e.g., response correctness). This enables RLSD to simultaneously harness the strengths of both RLVR and OPSD, achieving a higher convergence ceiling and superior training stability.

自己蒸留強化学習ベースのビデオ修復

Self-Distilled RLVR

要旨

Support