自分の間違いから学ぶ：自己蒸留のための学習可能なマイクロリフレクティブ軌道の構築

要旨

自己蒸留は、大規模言語モデルの推論能力を向上させる手法であり、モデル自身のロールアウトを学習信号として利用する。典型的には、暗黙的なロジットレベルのアライメントを通じて、特権的な目標分布に対するKLダイバージェンスを最小化する。しかし、この教師信号は制御されないサンプリングによって生成されるため、モデルの特定の誤りに関する診断的な洞察や、個々の失敗パターンに対する修正ガイダンスを提供しない。その結果、モデルは推論の失敗箇所とその原因を正確に特定する細粒度の修正を受けるのではなく、特権的な分布を模倣することを学習する。本論文では、Trajectory-Augmented Policy Optimization (TAPO) を提案する。TAPOは、自己蒸留を暗黙的な分布アライメントから明示的な軌道構築へと発展させる。強化学習訓練において、モデルは同一クエリに対して正解と不正解の両方のロールアウトを生成する。TAPOはこの対比的構造を活用して、微小反映的修正（micro-reflective corrections）、すなわち、モデルの誤った推論を失敗箇所まで保持し、そこに自然言語による診断と、同一サンプリンググループ内の正解参照に基づいた修正推論を挿入した新たな訓練軌道を構築する。各軌道は学習者自身のプレフィックスと解答に基づいているため、修正信号は、KLベースの手法が課す位置単位のアライメントよりも、モデルのオン方策分布をより大きく保持する。これらの軌道を統合するために、TAPOはモデルの能力境界における難易度を考慮した候補選択と、勾配汚染を防ぐための分離型アドバンテージ推定を導入する。AIME 2024、AIME 2025、HMMT 2025を用いた実験では、TAPOが同一の訓練ステップ数においてGRPOを一貫して上回る改善を示す。さらに、分析によりTAPOが初期推論と誤り訂正の両方の効果を強化することが示される。

English

Self-distillation improves reasoning in large language models by using the model's own rollouts as training signal, typically through implicit logit-level alignment that minimizes KL divergence toward a privileged target distribution. However, because this supervision is generated via uncontrolled sampling, it provides no diagnostic insight into the model's specific errors or corrective guidance for its individual failure patterns. Consequently, the model learns to imitate a privileged distribution rather than receiving fine-grained corrections that pinpoint where and why its reasoning fails. In this paper, we propose Trajectory-Augmented Policy Optimization (TAPO), which advances self-distillation from implicit distributional alignment to explicit trajectory construction. During RL training, the model produces both correct and incorrect rollouts to the same query, and TAPO leverages this contrastive structure to construct micro-reflective corrections, new training trajectories that retain the model's erroneous reasoning up to the point of failure, then insert a natural-language diagnosis and corrected reasoning guided by a correct reference from the same sampling group. Since each trajectory is anchored in the learner's own prefix and solutions, the corrective signal preserves the model's on-policy distribution to a greater extent than the position-wise alignment imposed by KL-based methods. To integrate these trajectories, TAPO introduces difficulty-aware candidate selection at the model's capability boundary and decoupled advantage estimation to prevent gradient contamination. Experiments on AIME 2024, AIME 2025, and HMMT 2025 show that TAPO achieves consistent improvements over GRPO under the same number of training steps. Further analysis demonstrates that TAPO strengthens both first-pass reasoning and error-correction effectiveness.