SCOPE:基于信号校准的同策略蒸馏增强与双路径自适应加权方法
SCOPE: Signal-Calibrated On-Policy Distillation Enhancement with Dual-Path Adaptive Weighting
April 12, 2026
作者: Binbin Zheng, Xing Ma, Yiheng Liang, Jingqing Ruan, Xiaoliang Fu, Kepeng Lin, Benchang Zhu, Ke Zeng, Xunliang Cai
cs.AI
摘要
在线策略强化学习已成为大型语言模型对齐推理的主流范式,但其稀疏的结果级奖励使得令牌级信用分配极为困难。在线策略蒸馏通过引入来自教师模型的密集令牌级KL监督缓解了这一问题,但通常将这种监督均匀应用于所有推演轨迹,忽略了信号质量的根本差异。我们提出信号校准的在线策略蒸馏增强框架,这是一种双路径自适应训练框架,根据推演轨迹的正确性将其路由至两个互补的监督路径。对于错误轨迹,该框架执行教师困惑度加权的KL蒸馏,优先处理教师模型展现出真正纠错能力的实例,同时降低不可靠指导的权重;对于正确轨迹,则采用学生困惑度加权的最大似然估计,将强化重点集中于能力边界上的低置信度样本,而非过度强化已掌握样本。两条路径均采用组级归一化来自适应校准权重分布,从而考虑不同提示间固有的难度差异。在六个推理基准上的大量实验表明,该框架在Avg@32和Pass@32指标上分别实现了11.42%和7.30%的平均相对提升,证明了其持续有效性。
English
On-policy reinforcement learning has become the dominant paradigm for reasoning alignment in large language models, yet its sparse, outcome-level rewards make token-level credit assignment notoriously difficult. On-Policy Distillation (OPD) alleviates this by introducing dense, token-level KL supervision from a teacher model, but typically applies this supervision uniformly across all rollouts, ignoring fundamental differences in signal quality. We propose Signal-Calibrated On-Policy Distillation Enhancement (SCOPE), a dual-path adaptive training framework that routes on-policy rollouts by correctness into two complementary supervision paths. For incorrect trajectories, SCOPE performs teacher-perplexity-weighted KL distillation to prioritize instances where the teacher demonstrates genuine corrective capability, while down-weighting unreliable guidance. For correct trajectories, it applies student-perplexity-weighted MLE to concentrate reinforcement on low-confidence samples at the capability boundary rather than over-reinforcing already mastered ones. Both paths employ a group-level normalization to adaptively calibrate weight distributions, accounting for the intrinsic difficulty variance across prompts. Extensive experiments on six reasoning benchmarks show that SCOPE achieves an average relative improvement of 11.42% in Avg@32 and 7.30% in Pass@32 over competitive baselines, demonstrating its consistent effectiveness.