从自身错误中学习:为自蒸馏构建可学习的微反思轨迹
Learning from Your Own Mistakes: Constructing Learnable Micro-Reflective Trajectories for Self-Distillation
June 17, 2026
作者: Zhilin Huang, Hang Gao, Ziqiang Dong, Yuan Chen, Yifeng Luo, Chujun Qin, Jingyi Wang, Yang Yang, Guanjun Jiang
cs.AI
摘要
自蒸馏通过利用模型自身生成的推理路径作为训练信号来提升大型语言模型的推理能力,通常采用隐式的logit层级对齐方式,通过最小化KL散度来逼近一个特权目标分布。然而,由于这种监督信号源于无控制的采样过程,它既无法诊断模型的具体错误,也无法针对其个体失败模式提供修正性指导。因此,模型只是学会了模仿特权分布,而非获得能精准定位其推理在何处、为何失败的细粒度修正。本文提出轨迹增强策略优化(TAPO),将自蒸馏从隐式的分布对齐发展为显式的轨迹构建。在强化学习训练中,模型针对同一查询同时生成正确与错误的推理路径,TAPO利用这种对比结构构建微反射修正——即新的训练轨迹:保留模型在失败点之前的错误推理过程,然后插入基于同采样组中正确参考的自然语言诊断与修正推理。由于每条轨迹都锚定在学习者自身的前缀与解答之上,这种修正信号相较于基于KL散度的位置级对齐方法,能在更大程度上保持模型的在策略分布。为整合这些轨迹,TAPO在模型能力边界处引入难度感知的候选选择,并采用解耦优势估计以防止梯度污染。在AIME 2024、AIME 2025及HMMT 2025上的实验表明,在相同训练步数下,TAPO相比GRPO取得了持续改进。进一步分析显示,TAPO同时增强了首轮推理能力与错误修正效果。
English
Self-distillation improves reasoning in large language models by using the model's own rollouts as training signal, typically through implicit logit-level alignment that minimizes KL divergence toward a privileged target distribution. However, because this supervision is generated via uncontrolled sampling, it provides no diagnostic insight into the model's specific errors or corrective guidance for its individual failure patterns. Consequently, the model learns to imitate a privileged distribution rather than receiving fine-grained corrections that pinpoint where and why its reasoning fails. In this paper, we propose Trajectory-Augmented Policy Optimization (TAPO), which advances self-distillation from implicit distributional alignment to explicit trajectory construction. During RL training, the model produces both correct and incorrect rollouts to the same query, and TAPO leverages this contrastive structure to construct micro-reflective corrections, new training trajectories that retain the model's erroneous reasoning up to the point of failure, then insert a natural-language diagnosis and corrected reasoning guided by a correct reference from the same sampling group. Since each trajectory is anchored in the learner's own prefix and solutions, the corrective signal preserves the model's on-policy distribution to a greater extent than the position-wise alignment imposed by KL-based methods. To integrate these trajectories, TAPO introduces difficulty-aware candidate selection at the model's capability boundary and decoupled advantage estimation to prevent gradient contamination. Experiments on AIME 2024, AIME 2025, and HMMT 2025 show that TAPO achieves consistent improvements over GRPO under the same number of training steps. Further analysis demonstrates that TAPO strengthens both first-pass reasoning and error-correction effectiveness.