적을수록 좋다: 온-정책 증류를 위한 조기 종료 롤아웃

초록

온-정책 증류(on-policy distillation)는 최근 표준 시퀀스 수준 모방의 유망한 대안으로 부상했으며, 교사 모델로 학생 자신의 롤아웃(rollout)을 평가하여 학생을 훈련시킨다. 그러나 이 패러다임에서 '오프-정책 교사 쇠퇴(Off-policy Teacher Decay)' 문제를 관찰한다: 이후 토큰의 경우, 학생의 초기 궤적이 교사에게 오프-정책인 맥락으로 사용될 때, 교사가 교정 점수를 생성하는 능력이 쇠퇴하여 사전 훈련 단계에서 학습된 토큰 완성 행동으로 되돌아갈 수 있다. 우리는 이 문제를 경험적으로 확인하고, 이를 해결하기 위해 조기 중단 롤아웃(Early Stopping Rollout, ESR)을 제안한다: 이는 단순히 롤아웃 생성을 첫 번째 응답 토큰으로 제한하는 간단하면서도 효과적인 증류 전략이다. 우리는 ESR이 모델 크기, 계열, 작업 및 훈련 방식 전반에서 전체 롤아웃 OPD 성능을 능가하며, 특히 교차 모델 계열 시나리오에서 훨씬 더 높은 GPU 효율성과 훈련 안정성을 보여줌을 입증한다. 또한 이 놀라운 성능 뒤에 있는 메커니즘을 추가로 조사하여, ESR의 '캐스케이딩 정렬(Cascading Alignment)' 및 '서브모드 커밋먼트(Sub-mode Commitment)' 효과를 발견했으며, 이는 ESR이 효과적으로 작동하고 때로는 교사 모델 성능을 초과하는 이유를 설명할 수 있다. 게다가, 이 위치 기반 토큰 선택 전략이 KL 발산(KL divergence) 및 엔트로피 신호만으로는 완전히 설명될 수 없음을 보여준다.

English

On-policy distillation has recently emerged as a promising alternative to standard sequence-level imitation, training a student by scoring its own rollouts with a teacher model. However, we observe ``Off-policy Teacher Decay'' problem in this paradigm: for the later tokens, with student's earlier trajectory as context that is off-policy to the teacher, the teacher's ability to produce a corrective score would decay, and may fall back to token-completion behavior learned in the pre-training stage. We empirically verify this problem, and we propose Early Stopping Rollout (ESR) to fix it: a simple yet effective distillation strategy that simply restricts the rollout generation to the first response tokens. We show that ESR both surpasses the full rollout OPD performance across model size, family, tasks and training regime, and exhibit much higher GPU efficiency and training stability, especially under cross model family scenarios. We further investigate the mechanism behind this surprising performance and discovered "Cascading Alignment" and "Sub-mode Commitment" effect of ESR that may explain why it works effectively and even sometimes exceeding the teacher model performance. Besides, we show that this position-based token selection strategy cannot be fully explainable by KL divergence and entropy signals.