ReNIO: LLMのオン方策蒸留における負の軌道重要度の再重み付け

要旨

オンポリシー蒸留（OPD）は、学生モデルを自身が生成した出力で学習させることでLLMの推論を改善するが、標準的なOPDはすべての学生生成出力（SGO）をその情報量に関わらず等しく扱う。我々は、制御されたフィルタリング実験において一貫した非対称性を観測する：OPDとオンポリシー自己蒸留（OPSD）の両方において、誤ったSGOのみで学習した方が正しいSGOのみで学習するよりも優れている。さらなる分析から、正しいSGOのみで学習したモデルはより短い推論トレースを生成し、振り返り行動が弱まる傾向がある一方、誤ったSGOはモデルの能力境界付近での探索的推論をよりよく保持することが示唆される。この信号を活用するために、完全な回答を含むロールアウトを必要とせずに、我々はReNIOを導入する。ReNIOはLLMオンポリシー蒸留における負軌道重要度の再重み付け（Reweights Negative trajectory Importance for LLM On-policy distillation）を行う。学生-教師確率比を用いることで、ReNIOは誤った推論トレースにつながる重要トークンを特定し、それらの情報を正規化されたサンプル重みに集約する。これにより、最終回答の正しさを観測することなく、可能性の高い負の軌道に本質的により大きな重みを割り当てる。ReNIOはプレフィックス条件付きトークン確率のみを使用するため、フルロールアウト強化学習に対するOPDのプレフィックス学習の利点を維持する。数学的推論とコード生成タスクの両方において、ReNIOはOPDとOPSDの両方を改善し、数学的推論ベンチマークにおいてQwen3-1.7Bで最大8.90%、R1-Distill-Qwen-7Bで最大10.00%の代表的な相対改善率を示す。コードリポジトリ: https://github.com/BDML-lab/ReNIO。

English

On-policy distillation (OPD) improves LLM reasoning by training a student model on its own generated outputs, but standard OPD treats all student-generated outputs (SGOs) equally regardless of their informativeness. We observe a consistent asymmetry in controlled filtering experiments: in both OPD and on-policy self distillation (OPSD), training only on incorrect SGOs outperforms training only on correct ones. Our further analysis suggests that models trained on correct-only SGOs tend to generate shorter reasoning traces and show weaker reflection behavior, while incorrect SGOs better preserve exploratory reasoning near the model's capability boundary. To exploit this signal without requiring full answer-containing rollouts, we introduce ReNIO, which Reweights Negative trajectory Importance for LLM On-policy distillation. By using the student-to-teacher probability ratio, ReNIO identifies pivotal tokens leading to wrong reasoning traces and aggregates their information into a normalized sample weight, inherently assigning larger weights to likely negative trajectories without observing the correctness of final-answer. Since Re-NIO only uses prefix-conditioned token probabilities, it preserves OPD's prefix training advantage over full-rollout reinforcement learning. Across both mathematical reasoning and code generation tasks, ReNIO improves both OPD and OPSD, with representative relative gains of up to 8.90% for Qwen3-1.7B and 10.00% for R1-Distill-Qwen-7B on mathematical reasoning benchmarks. Code repo: https://github.com/BDML-lab/ReNIO.