자기 제로 증류: 자기 수정을 통해 이진 보상을 조밀한 감독으로 전환하기

초록

현재 검증 가능한 환경의 사후 훈련 방법은 크게 두 가지 범주로 나뉩니다. 강화 학습(RLVR)은 널리 적용 가능하고 강력한 이진 보상을 활용하지만, 훈련 중에 희소한 지도만 제공합니다. 증류는 일반적으로 외부 교사 모델이나 고품질 데모에서 얻은 조밀한 토큰 수준 지도를 제공합니다. 그러나 이러한 지도를 수집하는 데는 비용이 많이 들거나 불가능할 수 있습니다. 본 논문은 RL보다 훈련 샘플 효율성이 현저히 높으며 외부 교사나 고품질 데모가 필요 없는 Self-Distillation Zero(SD-Zero) 방법을 제안합니다. SD-Zero는 단일 모델이 두 가지 역할(초기 응답을 생성하는 생성기와 해당 응답과 이진 보상을 조건으로 하여 개선된 응답을 생성하는 수정자)을 수행하도록 훈련합니다. 그런 다음 온-폴리시 자기 증류를 수행하여 생성기의 응답과 그 보상을 조건으로 한 수정자의 토큰 분포를 지도 신호로 사용해 수정자를 생성기에 증류합니다. 결과적으로 SD-Zero는 이진 보상을 조밀한 토큰 수준 자기 지도로 변환하도록 모델을 훈련시킵니다. Qwen3-4B-Instruct 및 Olmo-3-7B-Instruct를 사용한 수학 및 코드 추론 벤치마크에서 SD-Zero는 기본 모델 대비 최소 10% 이상의 성능 향상을 보였으며, 동일한 질문 세트와 훈련 샘플 예산 하에서 Rejection Fine-Tuning(RFT), GRPO, Self-Distillation Fine-Tuning(SDFT) 등의 강력한 베이스라인을 능가했습니다. 폭넓은 애블레이션 연구를 통해 제안 알고리즘의 두 가지 새로운 특성, 즉 (a) 보상을 바탕으로 생성기 응답에서 수정이 필요한 핵심 토큰을 수정자가 식별할 수 있는 토큰 수준 자기 지역화와 (b) 답안 수정 능력의 개선이 정기적인 교사 동기화를 통해 생성 성능으로 증류되는 반복적 자기 진화를 확인했습니다.

English

Current post-training methods in verifiable settings fall into two categories. Reinforcement learning (RLVR) relies on binary rewards, which are broadly applicable and powerful, but provide only sparse supervision during training. Distillation provides dense token-level supervision, typically obtained from an external teacher or using high-quality demonstrations. Collecting such supervision can be costly or unavailable. We propose Self-Distillation Zero (SD-Zero), a method that is substantially more training sample-efficient than RL and does not require an external teacher or high-quality demonstrations. SD-Zero trains a single model to play two roles: a Generator, which produces an initial response, and a Reviser, which conditions on that response and its binary reward to produce an improved response. We then perform on-policy self-distillation to distill the reviser into the generator, using the reviser's token distributions conditioned on the generator's response and its reward as supervision. In effect, SD-Zero trains the model to transform binary rewards into dense token-level self-supervision. On math and code reasoning benchmarks with Qwen3-4B-Instruct and Olmo-3-7B-Instruct, SD-Zero improves performance by at least 10% over the base models and outperforms strong baselines, including Rejection Fine-Tuning (RFT), GRPO, and Self-Distillation Fine-Tuning (SDFT), under the same question set and training sample budget. Extensive ablation studies show two novel characteristics of our proposed algorithm: (a) token-level self-localization, where the reviser can identify the key tokens that need to be revised in the generator's response based on reward, and (b) iterative self-evolution, where the improving ability to revise answers can be distilled back into generation performance with regular teacher synchronization.

자기 제로 증류: 자기 수정을 통해 이진 보상을 조밀한 감독으로 전환하기

Self-Distillation Zero: Self-Revision Turns Binary Rewards into Dense Supervision

초록

Support