自己蒸留におけるフィードバックアライメントの役割

要旨

言語モデルを追加のコンテキスト（例えば、前回の試行に対するフィードバック）で条件付けすると、通常は応答が改善される。自己蒸留は、そのコンテキストが存在しない場合でも、この改善をモデルが保持できるように訓練する手法である。この手法は、二つの設定におけるモデルの出力分布を一致させることで機能する。すなわち、質問のみを参照する生徒と、コンテキストも参照する自己教師である。したがって、モデルが学習する内容は自己教師が受け取るコンテキストに依存するが、このコンテキストの設計はほとんど未解明のままである。本稿では、固定された批評器からのフィードバックを用いてソルバーを訓練することで、自己蒸留におけるコンテキスト設計を研究する。三つの条件を比較する。(i) 二値報酬（GRPO）、(ii) 参照解、(iii) ソルバーの推論過程に整合したステップ単位の批評。ステップ整合的な批評が最大の改善をもたらし、GRPOを16.11ポイント、参照解で条件付けした自己蒸留を5.27ポイント上回った（Avg@12）。トークンごとのアドバンテージ分析により、その理由が明らかになった。ステップ整合的なフィードバックは、推論が失敗するトークンのみを対象とし、正しい動作はそのままにする。対照的に、参照解で条件付けすると、モデルはすべてのトークン（正しいステップも含む）で動作を変更するよう圧力を受ける。なぜなら、別の導出方法は必然的に表現やアプローチが異なるからである。このことは、フィードバックとソルバーの推論との間の構造的整合性が、自己蒸留の効果を左右する重要な要因であることを示唆している。

English

Conditioning a language model on additional context, such as feedback on a previous attempt, typically improves its response. Self-distillation trains the model to retain this improvement when the context is not present. The method works by matching the model's output distribution under two settings: a student that sees only the question, and a self-teacher that also sees the context. What the model learns therefore depends on what context the self-teacher receives, yet the design of this context remains largely unexplored. We study context design for self-distillation by training a solver on feedback from a frozen critic. We compare three conditions: (i) a binary reward (GRPO), (ii) the reference solution, and (iii) a step-by-step critique aligned to the solver's reasoning trace. Step-aligned critique yields the largest gains, outperforming GRPO by 16.11 points and reference-solution-conditioned self-distillation by 5.27 points (Avg@12). Per-token advantage analysis reveals why: step-aligned feedback targets only the tokens where reasoning fails, leaving correct behavior intact. Conditioning on the reference solution, by contrast, pressures the model to change its behavior at every token (even correct steps) because an alternative derivation inevitably differs in phrasing and approach. This suggests that structural alignment between feedback and the solver's reasoning is a key driver of self-distillation effectiveness.