ReNIO: LLM 온-정책 증류를 위한 부정 궤적 중요도 재가중

초록

온-정책 증류(OPD)는 학생 모델이 자체 생성한 출력을 학습에 활용함으로써 대규모 언어 모델(LLM)의 추론 능력을 향상시키지만, 표준 OPD는 학생 생성 출력(SGO)의 정보성과 관계없이 이를 동등하게 취급한다. 우리는 통제된 필터링 실험에서 일관된 비대칭성을 관찰했다. OPD와 온-정책 자기 증류(OPSD) 모두에서 오직 틀린 SGO만으로 훈련한 모델이 오직 올바른 SGO만으로 훈련한 모델보다 더 나은 성능을 보였다. 추가 분석 결과, 올바른 SGO만으로 훈련된 모델은 더 짧은 추론 궤적을 생성하고 반성 행동이 약화되는 경향이 있는 반면, 틀린 SGO는 모델의 능력 경계 부근에서 탐색적 추론을 더 잘 보존하는 것으로 나타났다. 이러한 신호를 최종 답변을 포함한 전체 롤아웃 없이 활용하기 위해, 우리는 ReNIO(부정 궤적 중요도 재가중을 통한 LLM 온-정책 증류)를 도입한다. ReNIO는 학생-교사 확률 비율을 사용하여 잘못된 추론 궤적으로 이끄는 핵심 토큰을 식별하고, 이들의 정보를 정규화된 샘플 가중치로 집계함으로써 최종 답변의 정답 여부를 관찰하지 않고도 자연스럽게 부정 궤적에 더 큰 가중치를 할당한다. Re-NIO는 접두사 조건부 토큰 확률만을 사용하므로, 전체 롤아웃 기반 강화 학습에 비해 OPD의 접두사 훈련 이점을 유지한다. 수학적 추론 및 코드 생성 과제 모두에서 ReNIO는 OPD와 OPSD를 모두 개선하며, 수학적 추론 벤치마크에서 Qwen3-1.7B의 경우 최대 8.90%, R1-Distill-Qwen-7B의 경우 최대 10.00%의 상대적 성능 향상을 보였다. 코드 저장소: https://github.com/BDML-lab/ReNIO.

English

On-policy distillation (OPD) improves LLM reasoning by training a student model on its own generated outputs, but standard OPD treats all student-generated outputs (SGOs) equally regardless of their informativeness. We observe a consistent asymmetry in controlled filtering experiments: in both OPD and on-policy self distillation (OPSD), training only on incorrect SGOs outperforms training only on correct ones. Our further analysis suggests that models trained on correct-only SGOs tend to generate shorter reasoning traces and show weaker reflection behavior, while incorrect SGOs better preserve exploratory reasoning near the model's capability boundary. To exploit this signal without requiring full answer-containing rollouts, we introduce ReNIO, which Reweights Negative trajectory Importance for LLM On-policy distillation. By using the student-to-teacher probability ratio, ReNIO identifies pivotal tokens leading to wrong reasoning traces and aggregates their information into a normalized sample weight, inherently assigning larger weights to likely negative trajectories without observing the correctness of final-answer. Since Re-NIO only uses prefix-conditioned token probabilities, it preserves OPD's prefix training advantage over full-rollout reinforcement learning. Across both mathematical reasoning and code generation tasks, ReNIO improves both OPD and OPSD, with representative relative gains of up to 8.90% for Qwen3-1.7B and 10.00% for R1-Distill-Qwen-7B on mathematical reasoning benchmarks. Code repo: https://github.com/BDML-lab/ReNIO.