自蒸餾強化學習視覺表示法（注：RLVR是Reinforcement Learning with Visual Representations的縮寫，此處採用意譯方式將核心概念"視覺表示"明確譯出，同時保留強化學習的專業術語規範。自蒸餾是self-distilled的標準學術譯法，指模型通過自身生成監督信號進行知識提煉的技術。）

摘要

同策略蒸餿（OPD）已成為大型語言模型領域的主流訓練範式。與僅能從環境可驗證結果中獲取稀疏信號的「具可驗證獎勵的強化學習（RLVR）」不同，OPD選擇較大模型作為教師，為每個採樣軌跡提供密集的細粒度信號。近期學界進一步探索同策略自我蒸餿（OPSD），使單一模型同時擔任教師與學生角色，其中教師端可獲取參考答案等特權資訊以實現自我演化。本文論證僅依賴特權教師產生的學習信號會導致嚴重資訊洩漏與長期訓練不穩定問題。據此，我們確立自我蒸餿的最佳適用場景，提出「具自我蒸餿的強化學習（RLSD）」：一方面運用自我蒸餿獲取標記級策略差異以決定細粒度更新幅度，同時持續透過RLVR從環境回饋（如回答正確性）獲取可靠更新方向。此設計使RLSD能兼融RLVR與OPSD優勢，達成更高收斂上限與卓越的訓練穩定性。

English

On-policy distillation (OPD) has become a popular training paradigm in the LLM community. This paradigm selects a larger model as the teacher to provide dense, fine-grained signals for each sampled trajectory, in contrast to reinforcement learning with verifiable rewards (RLVR), which only obtains sparse signals from verifiable outcomes in the environment. Recently, the community has explored on-policy self-distillation (OPSD), where the same model serves as both teacher and student, with the teacher receiving additional privileged information such as reference answers to enable self-evolution. This paper demonstrates that learning signals solely derived from the privileged teacher result in severe information leakage and unstable long-term training. Accordingly, we identify the optimal niche for self-distillation and propose RLSD (RLVR with Self-Distillation). Specifically, we leverage self-distillation to obtain token-level policy differences for determining fine-grained update magnitudes, while continuing to use RLVR to derive reliable update directions from environmental feedback (e.g., response correctness). This enables RLSD to simultaneously harness the strengths of both RLVR and OPSD, achieving a higher convergence ceiling and superior training stability.

Self-Distilled RLVR

摘要

Support