反饋對齊在自蒸餾中的作用

摘要

将语言模型以额外上下文（例如对先前尝试的反馈）作为条件，通常能改善其回答。自我蒸馏技术旨在让模型在缺乏该上下文时仍保留这种改善效果。该方法通过匹配模型在两种设置下的输出分布实现：仅看到问题的学生模型，以及能看到上下文的自教师模型。因此，模型所学内容取决于自教师模型接收的上下文类型，然而针对该上下文的设计仍鲜有研究。我们通过使用冻结评判器对求解器进行训练，研究了自我蒸馏中的上下文设计。我们比较了三种条件：(i) 二元奖励（GRPO），(ii) 参考答案，以及(iii) 与求解器推理轨迹对齐的逐步批评。逐步批评带来的收益最大，在Avg@12指标上分别比GRPO高出16.11分，比参考答案条件化的自我蒸馏高出5.27分。逐词优势分析揭示了原因：逐步对齐的反馈仅针对推理失败的词元，而保持正确行为不变。相比之下，以参考答案为条件会迫使模型在每一个词元（即使正确的步骤）上都改变行为，因为另一种推导在表述和方法上必然存在差异。这表明反馈与求解器推理之间的结构对齐是自我蒸馏有效性的关键驱动因素。

English

Conditioning a language model on additional context, such as feedback on a previous attempt, typically improves its response. Self-distillation trains the model to retain this improvement when the context is not present. The method works by matching the model's output distribution under two settings: a student that sees only the question, and a self-teacher that also sees the context. What the model learns therefore depends on what context the self-teacher receives, yet the design of this context remains largely unexplored. We study context design for self-distillation by training a solver on feedback from a frozen critic. We compare three conditions: (i) a binary reward (GRPO), (ii) the reference solution, and (iii) a step-by-step critique aligned to the solver's reasoning trace. Step-aligned critique yields the largest gains, outperforming GRPO by 16.11 points and reference-solution-conditioned self-distillation by 5.27 points (Avg@12). Per-token advantage analysis reveals why: step-aligned feedback targets only the tokens where reasoning fails, leaving correct behavior intact. Conditioning on the reference solution, by contrast, pressures the model to change its behavior at every token (even correct steps) because an alternative derivation inevitably differs in phrasing and approach. This suggests that structural alignment between feedback and the solver's reasoning is a key driver of self-distillation effectiveness.