自蒸馏归零:自我修订将二元奖励转化为密集监督
Self-Distillation Zero: Self-Revision Turns Binary Rewards into Dense Supervision
April 13, 2026
作者: Yinghui He, Simran Kaur, Adithya Bhaskar, Yongjin Yang, Jiarui Liu, Narutatsu Ri, Liam Fowl, Abhishek Panigrahi, Danqi Chen, Sanjeev Arora
cs.AI
摘要
当前可验证环境下的后训练方法主要分为两类。基于强化学习的可验证强化学习(RLVR)依赖二元奖励机制,虽具广泛适用性和强大潜力,但训练过程中仅能提供稀疏监督信号。蒸馏法则提供密集的词元级监督,通常需借助外部教师模型或高质量演示样本,但此类监督数据的获取成本高昂或难以获得。我们提出自蒸馏零样本方法(SD-Zero),该方法在训练样本效率上显著优于强化学习,且无需外部教师模型或高质量演示样本。SD-Zero通过训练单一模型扮演双重角色:生成器负责生成初始响应,修订器则基于该响应及其二元奖励信号生成改进版本。随后我们采用同策略自蒸馏技术,将修订器的词元分布(以生成器响应及其奖励为条件)作为监督信号,将其知识蒸馏至生成器。本质上,SD-Zero实现了将二元奖励转化为密集词元级自监督的机制。在数学推理与代码推理基准测试中,基于Qwen3-4B-Instruct和Olmo-3-7B-Instruct模型,SD-Zero在相同问题集和训练样本预算下,相比基线模型性能提升至少10%,且显著优于拒绝微调(RFT)、GRPO及自蒸馏微调(SDFT)等强基线方法。大量消融实验揭示了该算法的两大新颖特性:(a)词元级自定位能力——修订器可根据奖励信号精准识别生成器响应中需修正的关键词元;(b)迭代自进化机制——通过定期教师模型同步,将答案修订能力的提升持续反哺至生成性能的优化。
English
Current post-training methods in verifiable settings fall into two categories. Reinforcement learning (RLVR) relies on binary rewards, which are broadly applicable and powerful, but provide only sparse supervision during training. Distillation provides dense token-level supervision, typically obtained from an external teacher or using high-quality demonstrations. Collecting such supervision can be costly or unavailable. We propose Self-Distillation Zero (SD-Zero), a method that is substantially more training sample-efficient than RL and does not require an external teacher or high-quality demonstrations. SD-Zero trains a single model to play two roles: a Generator, which produces an initial response, and a Reviser, which conditions on that response and its binary reward to produce an improved response. We then perform on-policy self-distillation to distill the reviser into the generator, using the reviser's token distributions conditioned on the generator's response and its reward as supervision. In effect, SD-Zero trains the model to transform binary rewards into dense token-level self-supervision. On math and code reasoning benchmarks with Qwen3-4B-Instruct and Olmo-3-7B-Instruct, SD-Zero improves performance by at least 10% over the base models and outperforms strong baselines, including Rejection Fine-Tuning (RFT), GRPO, and Self-Distillation Fine-Tuning (SDFT), under the same question set and training sample budget. Extensive ablation studies show two novel characteristics of our proposed algorithm: (a) token-level self-localization, where the reviser can identify the key tokens that need to be revised in the generator's response based on reward, and (b) iterative self-evolution, where the improving ability to revise answers can be distilled back into generation performance with regular teacher synchronization.