自蒸馏强化学习视觉表示

摘要

同策略蒸馏（OPD）已成为大语言模型领域流行的训练范式。该范式选择较大模型作为教师，为每个采样轨迹提供密集的细粒度信号，这与依赖可验证奖励的强化学习（RLVR）形成对比——后者仅能从环境中的可验证结果获得稀疏信号。近期业界开始探索同策略自蒸馏（OPSD），即同一模型同时担任教师和学生角色，其中教师端可获得参考答案等特权信息以实现自我进化。本文论证仅从特权教师推导的学习信号会导致严重的信息泄露和长期训练不稳定。据此，我们明确了自蒸馏的最佳适用场景，提出RLSD（基于自蒸馏的RLVR）。具体而言，我们利用自蒸馏获取词元级策略差异以确定细粒度更新幅度，同时继续采用RLVR从环境反馈（如回答正确性）推导可靠的更新方向。这使得RLSD能同时融合RLVR和OPSD的优势，实现更高的收敛上限和更优的训练稳定性。

English

On-policy distillation (OPD) has become a popular training paradigm in the LLM community. This paradigm selects a larger model as the teacher to provide dense, fine-grained signals for each sampled trajectory, in contrast to reinforcement learning with verifiable rewards (RLVR), which only obtains sparse signals from verifiable outcomes in the environment. Recently, the community has explored on-policy self-distillation (OPSD), where the same model serves as both teacher and student, with the teacher receiving additional privileged information such as reference answers to enable self-evolution. This paper demonstrates that learning signals solely derived from the privileged teacher result in severe information leakage and unstable long-term training. Accordingly, we identify the optimal niche for self-distillation and propose RLSD (RLVR with Self-Distillation). Specifically, we leverage self-distillation to obtain token-level policy differences for determining fine-grained update magnitudes, while continuing to use RLVR to derive reliable update directions from environmental feedback (e.g., response correctness). This enables RLSD to simultaneously harness the strengths of both RLVR and OPSD, achieving a higher convergence ceiling and superior training stability.