변분 정책 증류를 통한 언어 피드백으로부터의 학습

초록

검증 가능한 보상 기반 강화학습(RLVR)은 희박한 결과 신호로 인해 복잡한 추론 작업에서 심각한 탐색 병목 현상을 겪는다. 최근의 온폴리 자기 증류 방법은 언어 피드백을 활용하여 밀집된 토큰 수준의 감독 신호를 생성함으로써 이 문제를 해결하고자 한다. 그러나 이러한 접근법은 피드백을 해석하기 위해 고정된 수동적 교사에 의존한다. 학생 정책이 개선됨에 따라 교사의 제로샷 평가 능력은 정체되며, 결국 추가 학습이 중단된다. 이를 극복하기 위해 우리는 언어 피드백으로부터의 학습을 변분 기대최대화(EM) 문제로 공식화하는 프레임워크인 변분 정책 증류(VPD)를 제안한다. VPD는 두 정책을 공동 진화시킨다. E-단계에서 교사는 적응형 신뢰 영역 업데이트를 통해 궤적 결과에 대해 능동적으로 개선되며, 텍스트 피드백을 동적으로 개선된 목표 토큰 분포로 변환한다. M-단계에서 학생은 자신의 온폴리시 롤아웃에서 이 밀집된 분포적 안내를 내면화한다. VPD는 텍스트 비판으로부터 실행 가능한 신호를 추출하는 교사의 능력을 지속적으로 향상시킴으로써 수동적 증류의 한계를 극복한다. 과학적 추론 및 코드 생성 작업에 대한 다양한 진단 피드백 소스에서 평가된 VPD는 표준 RLVR 및 기존 자기 증류 기준선을 지속적으로 능가한다. 마지막으로, 엄격한 수학적 추론 및 콜드 스타트 체제에서 프레임워크를 강도 테스트함으로써, 순수 환경 기반 RL과 비교하여 피드백 기반 자기 증류의 근본적인 한계를 조명한다.

English

Reinforcement learning from verifiable rewards (RLVR) suffers from sparse outcome signals, creating severe exploration bottlenecks on complex reasoning tasks. Recent on-policy self-distillation methods attempt to address this by utilizing language feedback to generate dense, token-level supervision. However, these approaches rely on a fixed, passive teacher to interpret the feedback. As the student policy improves, the teacher's zero-shot assessment capabilities plateau, ultimately halting further learning. To overcome this, we propose Variational Policy Distillation (VPD), a framework that formalizes learning from language feedback as a Variational Expectation-Maximization (EM) problem. VPD co-evolves both policies: in the E-step, the teacher is actively refined on trajectory outcomes via an adaptive trust-region update, translating textual feedback into a dynamically improved target token distribution. In the M-step, the student internalizes this dense distributional guidance on its own on-policy rollouts. By continuously improving the teacher's ability to extract actionable signals from textual critique, VPD overcomes the limitations of passive distillation. Evaluated across diverse sources of diagnostic feedback on scientific reasoning and code generation tasks, VPD consistently outperforms both standard RLVR and existing self-distillation baselines. Finally, by stress-testing our framework on rigid mathematical reasoning and cold-start regimes, we illuminate the fundamental bounds of feedback-driven self-distillation compared to pure environment-driven RL.