자신의 실수로부터 학습하기: 자기 증류를 위한 학습 가능한 미세 반성적 궤적 구축

초록

자기 증류(self-distillation)는 대규모 언어 모델의 추론 능력을 향상시키기 위해 모델 자체의 롤아웃(rollout)을 훈련 신호로 활용하며, 일반적으로 특권 분포(privileged target distribution)에 대한 KL 발산을 최소화하는 암시적 로짓 수준 정렬을 통해 이루어진다. 그러나 이러한 감독은 통제되지 않은 샘플링을 통해 생성되므로, 모델의 특정 오류에 대한 진단적 통찰이나 개별 실패 패턴에 대한 교정적 지침을 제공하지 않는다. 결과적으로 모델은 추론이 실패하는 위치와 원인을 정밀하게 교정받기보다는, 특권 분포를 모방하는 학습을 수행한다. 본 논문에서는 자기 증류를 암시적 분포 정렬에서 명시적 궤적 구성으로 발전시키는 TAPO(Trajectory-Augmented Policy Optimization)를 제안한다. 강화학습(RL) 훈련 중 모델은 동일한 질의에 대해 정답과 오답 롤아웃을 모두 생성하며, TAPO는 이러한 대조적 구조를 활용하여 미시 반영 교정(micro-reflective corrections)을 구성한다. 즉, 실패 지점까지 모델의 오류 추론을 유지한 후, 동일한 샘플링 그룹의 정답 참조를 기반으로 자연어 진단과 교정된 추론을 삽입한 새로운 훈련 궤적을 생성한다. 각 궤도가 학습자 자신의 접두사와 해에 기반하므로, 교정 신호는 KL 기반 방법이 부과하는 위치별 정렬보다 모델의 온폴리시 분포를 더 잘 유지한다. 이러한 궤적을 통합하기 위해 TAPO는 모델의 능력 경계에서 난이도를 고려한 후보 선택(difficulty-aware candidate selection)과 그래디언트 오염을 방지하는 분리된 이점 추정(decoupled advantage estimation)을 도입한다. AIME 2024, AIME 2025 및 HMMT 2025에 대한 실험 결과, TAPO는 동일한 훈련 단계 수에서 GRPO보다 일관된 성능 향상을 달성한다. 추가 분석은 TAPO가 첫 번째 통과 추론(first-pass reasoning)과 오류 수정 효과성(error-correction effectiveness)을 모두 강화함을 보여준다.

English

Self-distillation improves reasoning in large language models by using the model's own rollouts as training signal, typically through implicit logit-level alignment that minimizes KL divergence toward a privileged target distribution. However, because this supervision is generated via uncontrolled sampling, it provides no diagnostic insight into the model's specific errors or corrective guidance for its individual failure patterns. Consequently, the model learns to imitate a privileged distribution rather than receiving fine-grained corrections that pinpoint where and why its reasoning fails. In this paper, we propose Trajectory-Augmented Policy Optimization (TAPO), which advances self-distillation from implicit distributional alignment to explicit trajectory construction. During RL training, the model produces both correct and incorrect rollouts to the same query, and TAPO leverages this contrastive structure to construct micro-reflective corrections, new training trajectories that retain the model's erroneous reasoning up to the point of failure, then insert a natural-language diagnosis and corrected reasoning guided by a correct reference from the same sampling group. Since each trajectory is anchored in the learner's own prefix and solutions, the corrective signal preserves the model's on-policy distribution to a greater extent than the position-wise alignment imposed by KL-based methods. To integrate these trajectories, TAPO introduces difficulty-aware candidate selection at the model's capability boundary and decoupled advantage estimation to prevent gradient contamination. Experiments on AIME 2024, AIME 2025, and HMMT 2025 show that TAPO achieves consistent improvements over GRPO under the same number of training steps. Further analysis demonstrates that TAPO strengthens both first-pass reasoning and error-correction effectiveness.