자기 증류에서 피드백 정렬의 역할

초록

언어 모델에 이전 시도에 대한 피드백과 같은 추가 맥락을 조건화하면 일반적으로 응답이 개선된다. 자기 증류(self-distillation)는 이러한 맥락이 없을 때도 모델이 이 개선을 유지하도록 훈련한다. 이 방법은 두 가지 설정에서 모델의 출력 분포를 일치시키는 방식으로 작동한다. 하나는 질문만 보는 학생(student)이고, 다른 하나는 맥락도 함께 보는 자기 교사(self-teacher)이다. 따라서 모델이 학습하는 내용은 자기 교사가 받는 맥락에 따라 달라지지만, 이 맥락의 설계는 거의 탐구되지 않은 상태로 남아 있다. 본 연구는 고정된 비평가(critic)로부터 피드백을 받아 해결사를 훈련함으로써 자기 증류를 위한 맥락 설계를 탐구한다. 세 가지 조건을 비교한다: (i) 이진 보상(GRPO), (ii) 참조 해법, (iii) 해결사의 추론 과정에 정렬된 단계별 비판(step-by-step critique). 단계 정렬 비판이 가장 큰 이득을 가져왔으며, GRPO보다 16.11점, 참조 해법 조건부 자기 증류보다 5.27점 더 높은 성능을 보였다(Avg@12 기준). 토큰별 이점 분석(per-token advantage analysis)은 그 이유를 밝혀낸다: 단계 정렬 피드백은 추론이 실패한 토큰에만 초점을 맞추고 올바른 행동은 그대로 남겨둔다. 반면, 참조 해법을 조건화하면 대안적 유도 과정이 필연적으로 표현과 접근 방식에서 차이가 나기 때문에 모델이 모든 토큰(올바른 단계에서조차)에서 행동을 바꾸도록 압박한다. 이는 피드백과 해결사의 추론 사이의 구조적 정렬이 자기 증류 효과성의 핵심 동인임을 시사한다.

English

Conditioning a language model on additional context, such as feedback on a previous attempt, typically improves its response. Self-distillation trains the model to retain this improvement when the context is not present. The method works by matching the model's output distribution under two settings: a student that sees only the question, and a self-teacher that also sees the context. What the model learns therefore depends on what context the self-teacher receives, yet the design of this context remains largely unexplored. We study context design for self-distillation by training a solver on feedback from a frozen critic. We compare three conditions: (i) a binary reward (GRPO), (ii) the reference solution, and (iii) a step-by-step critique aligned to the solver's reasoning trace. Step-aligned critique yields the largest gains, outperforming GRPO by 16.11 points and reference-solution-conditioned self-distillation by 5.27 points (Avg@12). Per-token advantage analysis reveals why: step-aligned feedback targets only the tokens where reasoning fails, leaving correct behavior intact. Conditioning on the reference solution, by contrast, pressures the model to change its behavior at every token (even correct steps) because an alternative derivation inevitably differs in phrasing and approach. This suggests that structural alignment between feedback and the solver's reasoning is a key driver of self-distillation effectiveness.