軌跡精製蒸留

要旨

オン方針蒸留（OPD）は、大規模言語モデル（LLM）の事後学習における中心的なツールとなりつつあり、生徒自身のロールアウトに沿ってトークン単位の密な教師監督を提供する。本研究では、OPDに共通する構造的原因を特定し、これを「プレフィックス障害」と呼ぶ。プレフィックス障害の下では、トークン単位の密な教師監督がバイモーダルな教師混合と断片化された勾配を引き起こし、トークンレベルの損失打ち切りや再重み付けでは対処できない。この観察から、トークンレベルの損失介入を超え、軌跡レベルの出力修正へと進む動機を得た。そこで我々は、オン方針サポート内で教師の指導の下で生徒のロールアウトを修正する軌跡レベルの修正手法「軌跡精製蒸留（TRD）」を提案する。蒸留前に問題のあるプレフィックスを修正することで、TRDはプレフィックス障害をその発生源で軽減する。さらにTRDは、元のロールアウトがすでに正しい場合でも、教師の指導の下で代替の有効な導出に生徒を晒すことで、探索を改善する。TRDは、特権情報を条件とした生徒モデルを教師として使用するパラメータ共有の変種であるオン方針自己蒸留（OPSD）にも適用できる。複数スケールの多様なベンチマークとベースモデルにわたって、TRDは従来のベースラインを一貫して上回り、単回試行の精度を向上させるとともに推論範囲を拡大する。コードは https://github.com/louieworth/trd で入手可能である。

English

On-policy distillation (OPD) has become a central post-training tool for large language models (LLMs), providing dense per-token teacher supervision along the student's own rollouts. In this work, we identify a common structural cause underlying OPD, which we call prefix failure. Under prefix failure, dense per-token supervision induces a bimodal teacher mixture and fragmented gradients that token-level loss truncation or reweighting fail to address. This observation motivates us to move beyond token-level loss interventions toward trajectory-level output corrections. We thus propose Trajectory-Refined Distillation (TRD), a trajectory-level correction method that revises the student's rollout under the teacher guidance while within on-policy support. By correcting problematic prefixes before distillation, TRD mitigates prefix failure at its source. Moreover, TRD improves the exploration by exposing the student to alternative valid derivations under teacher guidance, even when the original rolls are already correct. TRD can also be applied to on-policy self-distillation (OPSD), a parameter-sharing variant that uses the student model conditioned on privileged informations as the teacher. Across a wide range of benchmarks and base models at multiple scales, TRD consistently outperforms prior baselines, improving single-attempt accuracy and broadening reasoning coverage. Code is available at https://github.com/louieworth/trd