ReNIO：为大语言模型在策略蒸馏重新加权负轨迹重要性

摘要

同策略蒸馏（OPD）通过让学生模型在其自身生成的输出上训练来提升大语言模型的推理能力，但标准OPD对所有学生生成输出（SGO）一视同仁，忽略了它们的信息量差异。我们在受控过滤实验中观察到一致的非对称性：在OPD和同策略自蒸馏（OPSD）中，仅使用错误SGO进行训练的效果优于仅使用正确SGO。进一步分析表明，仅使用正确SGO训练的模型倾向于生成更短的推理链，且反思行为较弱；而错误SGO则能更好地保留模型能力边界附近的探索性推理。为利用这一信号而无须生成包含完整答案的轨迹，我们提出ReNIO（Reweights Negative trajectory Importance for LLM On-policy distillation），即对大语言模型同策略蒸馏中的负轨迹重要性进行重加权。通过利用学生与教师的概率比，ReNIO能识别导致错误推理链的关键词元，并将其信息聚合为归一化的样本权重，从而在无需观察最终答案正确性的情况下，天然赋予可能的负轨迹更大权重。由于ReNIO仅使用基于前缀的条件词元概率，它保留了OPD在前缀训练上相对于完整轨迹强化学习的优势。在数学推理和代码生成任务上，ReNIO均能改进OPD和OPSD，在数学推理基准测试中，Qwen3-1.7B和R1-Distill-Qwen-7B的代表性相对增益分别达到8.90%和10.00%。代码仓库：https://github.com/BDML-lab/ReNIO。

English

On-policy distillation (OPD) improves LLM reasoning by training a student model on its own generated outputs, but standard OPD treats all student-generated outputs (SGOs) equally regardless of their informativeness. We observe a consistent asymmetry in controlled filtering experiments: in both OPD and on-policy self distillation (OPSD), training only on incorrect SGOs outperforms training only on correct ones. Our further analysis suggests that models trained on correct-only SGOs tend to generate shorter reasoning traces and show weaker reflection behavior, while incorrect SGOs better preserve exploratory reasoning near the model's capability boundary. To exploit this signal without requiring full answer-containing rollouts, we introduce ReNIO, which Reweights Negative trajectory Importance for LLM On-policy distillation. By using the student-to-teacher probability ratio, ReNIO identifies pivotal tokens leading to wrong reasoning traces and aggregates their information into a normalized sample weight, inherently assigning larger weights to likely negative trajectories without observing the correctness of final-answer. Since Re-NIO only uses prefix-conditioned token probabilities, it preserves OPD's prefix training advantage over full-rollout reinforcement learning. Across both mathematical reasoning and code generation tasks, ReNIO improves both OPD and OPSD, with representative relative gains of up to 8.90% for Qwen3-1.7B and 10.00% for R1-Distill-Qwen-7B on mathematical reasoning benchmarks. Code repo: https://github.com/BDML-lab/ReNIO.