자기 지식 증류 강화학습 비전-언어 표현

초록

온-폴리시 지식 증류(OPD)는 LLM 커뮤니티에서 널리 사용되는 훈련 패러다임이 되었습니다. 이 패러다임은 검증 가능한 보상을 활용한 강화 학습(RLVR)이 환경 내 검증 가능한 결과로부터 희소한 신호만을 얻는 것과 대조적으로, 더 큰 모델을 교사로 선정하여 각 샘플링된 궤적에 대해 조밀하고 세분화된 신호를 제공합니다. 최근 커뮤니티에서는 동일한 모델이 교사와 학생 역할을 모두 수행하며, 교사가 참조 답변과 같은 추가적인 특권 정보를 받아 자기 진화를 가능하게 하는 온-폴리시 자기 증류(OPSD)를 탐구하고 있습니다. 본 논문은 특권을 가진 교사로부터만 도출된 학습 신호가 심각한 정보 누출과 불안정한 장기 훈련을 초래함을 보여줍니다. 이에 따라 우리는 자기 증류의 최적 적용 영역을 규명하고 RLSD(자기 증류를 결합한 RLVR)를 제안합니다. 구체적으로, 우리는 자기 증류를 활용하여 토큰 수준의 정책 차이를 얻어 세분화된 업데이트 강도를 결정하는 한편, 환경적 피드백(예: 응답 정확도)으로부터 신뢰할 수 있는 업데이트 방향을 도출하기 위해 RLVR을 계속 사용합니다. 이를 통해 RLSD는 RLVR과 OPSD의 강점을 동시에 활용하여 더 높은 수렴 한계와 우수한 훈련 안정성을 달성합니다.

English

On-policy distillation (OPD) has become a popular training paradigm in the LLM community. This paradigm selects a larger model as the teacher to provide dense, fine-grained signals for each sampled trajectory, in contrast to reinforcement learning with verifiable rewards (RLVR), which only obtains sparse signals from verifiable outcomes in the environment. Recently, the community has explored on-policy self-distillation (OPSD), where the same model serves as both teacher and student, with the teacher receiving additional privileged information such as reference answers to enable self-evolution. This paper demonstrates that learning signals solely derived from the privileged teacher result in severe information leakage and unstable long-term training. Accordingly, we identify the optimal niche for self-distillation and propose RLSD (RLVR with Self-Distillation). Specifically, we leverage self-distillation to obtain token-level policy differences for determining fine-grained update magnitudes, while continuing to use RLVR to derive reliable update directions from environmental feedback (e.g., response correctness). This enables RLSD to simultaneously harness the strengths of both RLVR and OPSD, achieving a higher convergence ceiling and superior training stability.

자기 지식 증류 강화학습 비전-언어 표현

Self-Distilled RLVR

초록

Support