분포적 DAgger를 활용한 풍부한 피드백 기반 강화 학습

초록

추론 모델은 빠르게 발전해 왔지만, 검증 가능한 보상을 통한 강화 학습(RLVR)의 지배적인 방법은 여전히 놀라울 정도로 좁다. 즉, 많은 응답을 샘플링하고 최종 답변의 정확성 여부를 나타내는 단일 비트로 각각에 보상을 부여하는 것이다. 그러나 많은 설정에서 실행 추적, 도구 출력, 전문가 수정 및 모델 자기 평가를 포함한 풍부한 피드백이 제공된다. 본 연구에서는 고전적 모방 학습 알고리즘인 DAgger의 분포적 변형을 통해 이러한 피드백을 활용하는 방법을 연구한다. 여기서 학습자는 현재 정책이 방문한 상태에 대한 전문가 분포에 국소적으로 접근할 수 있다. 이는 블랙박스 전문가를 수용하는 단순한 순방향 교차 엔트로피 목적 함수를 생성하며, 이 목적 함수의 시퀀스 수준 그래디언트는 미래의 전문가-학생 불일치를 초기 결정으로 전파하여 풍부한 신용 할당을 수행한다. 우리는 역방향 KL 또는 Jensen-Shannon에 기반한 자기 증류 목적 함수를 사용한 사전 강화 학습이 단조 정책 개선을 보장하지 못함을 보여준다. 즉, 전문가가 더 높은 보상을 가지더라도 업데이트가 더 나쁜 행동에 대한 확률을 증가시킬 수 있다. 반대로, 순방향 교차 엔트로피가 단조 정책 개선을 허용하고 후회에 대한 보장을 제공함을 보여준다. 더 나아가 우리의 목적 함수가 교사 가중 성공 가능도의 하한을 최적화하여 Pass@N을 향상시킴을 보여준다. 실험적으로, 우리의 접근 방식인 DistIL은 과학적 추론, 코딩, 어려운 수학 문제 해결 등 다양한 영역에서 RLVR 및 자기 증류 기반 강화 학습 기준선보다 성능이 향상됨을 보여준다.

English

Reasoning models have advanced rapidly, but the dominant reinforcement learning from verifiable rewards (RLVR) recipe remains surprisingly narrow: sample many responses and reward each with a single bit indicating whether the final answer is correct. Yet many settings provide rich feedback, including execution traces, tool outputs, expert corrections, and model self-evaluations. We study how to use such feedback through a distributional variant of the classic imitation learning algorithm DAgger, where the learner has local access to an expert distribution on states visited by the current policy. This yields a simple forward cross-entropy objective that admits a blackbox expert and whose sequence-level gradient {conduct rich credit assignment by propagating} future expert-student disagreement back to earlier decisions. We show that prior RL with self-distillation objectives based on reverse KL or Jensen-Shannon fail to guarantee monotonic policy improvement: even when the expert has higher reward, their updates may increase probability on worse actions. In contrast, we show that forward cross-entropy admits monotonic policy improvement and enjoys guarantees on regret. We further show that our objective optimizes a lower bound on teacher-weighted likelihood of success, leading to improved Pass@N. Empirically, our approach, DistIL, improves over RLVR and RL with self-distillation baselines across a variety of domains: scientific reasoning, coding, and solving hard mathematical problems.