ChatPaper.aiChatPaper

通过样本路由统一群体相对与自蒸馏策略优化

Unifying Group-Relative and Self-Distillation Policy Optimization via Sample Routing

April 2, 2026
作者: Gengsheng Li, Tianyu Yang, Junfeng Fang, Mingyang Song, Mao Zheng, Haiyun Guo, Dan Zhang, Jinqiao Wang, Tat-Seng Chua
cs.AI

摘要

可验证奖励的强化学习(RLVR)已成为大语言模型后训练的标准范式。虽然群体相对策略优化(GRPO)被广泛采用,但其粗粒度的信用分配机制对失败样本进行统一惩罚,缺乏有效纠正特定偏差所需的词元级关注。自蒸馏策略优化(SDPO)通过提供更密集、更具针对性的对数级监督来解决这一问题,能实现快速的早期改进,但在长期训练中经常崩溃。我们将这种后期不稳定性归因于两个固有缺陷:对已正确样本的自蒸馏会引入优化模糊性,且自我教师信号的可靠性会逐渐衰减。为此,我们提出样本路由策略优化(SRPO),这一统一在轨框架将正确样本路由至GRPO的奖励对齐强化模块,将失败样本路由至SDPO的定向对数级修正模块。SRPO进一步引入熵感知动态加权机制,抑制高熵值的不可靠蒸馏目标,同时强化置信度高的目标。在五个基准测试和两种模型规模上的评估表明,SRPO兼具SDPO的快速早期改进能力和GRPO的长期稳定性,其峰值表现持续超越两种基线方法——在Qwen3-8B模型上将五项基准平均得分较GRPO提升3.4%,较SDPO提升6.3%,同时生成适中长度的响应,并将单步计算成本最高降低17.2%。
English
Reinforcement learning with verifiable rewards (RLVR) has become a standard paradigm for post-training large language models. While Group Relative Policy Optimization (GRPO) is widely adopted, its coarse credit assignment uniformly penalizes failed rollouts, lacking the token-level focus needed to efficiently address specific deviations. Self-Distillation Policy Optimization (SDPO) addresses this by providing denser, more targeted logit-level supervision that facilitates rapid early improvement, yet it frequently collapses during prolonged training. We trace this late-stage instability to two intrinsic flaws: self-distillation on already-correct samples introduces optimization ambiguity, and the self-teacher's signal reliability progressively degrades. To resolve these issues, we propose Sample-Routed Policy Optimization (SRPO), a unified on-policy framework that routes correct samples to GRPO's reward-aligned reinforcement and failed samples to SDPO's targeted logit-level correction. SRPO further incorporates an entropy-aware dynamic weighting mechanism to suppress high-entropy, unreliable distillation targets while emphasizing confident ones. Evaluated across five benchmarks and two model scales, SRPO achieves both the rapid early improvement of SDPO and the long-horizon stability of GRPO. It consistently surpasses the peak performance of both baselines, raising the five-benchmark average on Qwen3-8B by 3.4% over GRPO and 6.3% over SDPO, while simultaneously yielding moderate response lengths and lowering per-step compute cost by up to 17.2%.
PDF231April 8, 2026