反馈对齐在自蒸馏中的作用

摘要

将语言模型置于额外上下文中（例如对先前尝试的反馈）通常能提升其响应质量。自蒸馏通过训练模型在无上下文时仍保持这种改进效果：该方法通过匹配两种设置下的模型输出分布实现——仅看到问题的学生模型，以及同时看到上下文的自我教师模型。因此，模型所学的内容取决于自我教师所接收的上下文，然而这种上下文的设计目前尚未得到充分探索。我们通过基于冻结评判器的反馈训练求解器，来研究自蒸馏的上下文设计。我们比较了三种条件：（i）二元奖励（GRPO），(ii) 参考答案，（iii）与求解器推理轨迹对齐的逐步批判。步骤对齐的批判取得了最大增益，在Avg@12指标上比GRPO高出16.11分，比参考答案条件化的自蒸馏高出5.27分。逐词优势分析揭示了原因：步骤对齐的反馈仅针对推理失败的词元，而保留正确行为。相比之下，以参考答案为条件会迫使模型在每个词元（甚至正确步骤）处改变其行为，因为替代推导方案在措辞和方法上不可避免地存在差异。这表明反馈与求解器推理过程之间的结构对齐是自蒸馏有效性的关键驱动因素。

English

Conditioning a language model on additional context, such as feedback on a previous attempt, typically improves its response. Self-distillation trains the model to retain this improvement when the context is not present. The method works by matching the model's output distribution under two settings: a student that sees only the question, and a self-teacher that also sees the context. What the model learns therefore depends on what context the self-teacher receives, yet the design of this context remains largely unexplored. We study context design for self-distillation by training a solver on feedback from a frozen critic. We compare three conditions: (i) a binary reward (GRPO), (ii) the reference solution, and (iii) a step-by-step critique aligned to the solver's reasoning trace. Step-aligned critique yields the largest gains, outperforming GRPO by 16.11 points and reference-solution-conditioned self-distillation by 5.27 points (Avg@12). Per-token advantage analysis reveals why: step-aligned feedback targets only the tokens where reasoning fails, leaving correct behavior intact. Conditioning on the reference solution, by contrast, pressures the model to change its behavior at every token (even correct steps) because an alternative derivation inevitably differs in phrasing and approach. This suggests that structural alignment between feedback and the solver's reasoning is a key driver of self-distillation effectiveness.