超越GRPO与在策略蒸馏：语言模型后训练的实证稀疏到密集奖励原则

摘要

在标注可验证训练数据成为约束瓶颈的场景中，每个经检验的样本都需谨慎分配。标准做法是将这些数据直接用于待部署的模型，例如在部署阶段的学生模型上运行GRPO。我们认为这往往是一种低效分配，因为它忽视了一条奖励密度原则：稀疏的序列级奖励应训练那些能进行有效探索的模型，而密集的词元级教师奖励则应应用于旨在将行为压缩至较小模型的场景。基于此视角，GRPO风格的稀疏强化学习与OPD风格的密集教师监督并非彼此独立的方案，而是两种不同的奖励密度范式。分配规则很简单：将稀缺的标注训练数据上游用于最强的模型，使其能够将数据转化为奖励塑造的行为，再通过密集监督将这一行为向下游转移。我们在Qwen3和Llama模型上，通过可验证数学任务验证了这一规则。在固定Qwen3-1.7B部署学生模型规模的情况下，经密集桥接蒸馏的强化学习改进型8B教师模型，其表现优于在同名学生上直接运行GRPO；而同一教师模型在强化学习前进行转移的效果则较差。桥接至关重要：在师生模型间采用教师轨迹的前向KL预热、随后对学生轨迹进行OPD处理，这一流程在桥接后任何学生端稀疏强化学习介入之前，均在MATH基准上持续表现最优，同时为规范化的8B/14B教师模型提供了最佳的第三阶段前AIME最终结果。桥接还能提升后续学生端稀疏强化学习的效果：对于初始表现不佳的学生模型，GRPO在MATH上的成绩从75.4%提升至78.5%（经桥接后），且比匹配的重复控制组高出2.8个百分点。实际操作原则是：避免将稀缺标注数据用于最缺乏准备的策略——将稀疏奖励用于教师端的探索，密集转移用于学生端的压缩，而学生端的稀疏强化学习仅在桥接之后进行。

English

In settings where labeled verifiable training data is the binding constraint, each checked example should be allocated carefully. The standard practice is to use this data directly on the model that will be deployed, for example by running GRPO on the deployment student. We argue that this is often an inefficient allocation because it overlooks a reward-density principle: sparse sequence-level reward should train models where exploration is productive, while dense token-level teacher reward should be used where the aim is to compress behavior into a smaller model. In this view, GRPO-style sparse RL and OPD-style dense teacher supervision are not separate recipes; they are different reward-density regimes. The allocation rule is simple: use scarce labeled training data upstream on the strongest model that can turn it into reward-shaped behavior, then transfer that behavior downstream as dense supervision. We evaluate this rule on verifiable math with Qwen3 and Llama models. At fixed Qwen3-1.7B deployment-student size, an RL-improved 8B teacher distilled through the dense bridge outperforms direct GRPO on the same student, while transfer from the same teacher before RL underperforms. The bridge is important: a forward-KL warmup on teacher rollouts followed by OPD on student rollouts is consistently strongest on MATH before any post-bridge student-side sparse RL, and also gives the best pre-Stage~3 AIME endpoints for the canonical 8B/14B teachers. The bridge also makes later student-side sparse RL effective: GRPO that is weak on a cold student lifts MATH from 75.4% to 78.5% after the bridge and outperforms a matched replay control by 2.8 points. The operational principal is to avoid using scarce labeled data on the least prepared policy: use sparse reward for teacher-side discovery, dense transfer for student compression, and student-side sparse reward only after the bridge.

超越GRPO与在策略蒸馏：语言模型后训练的实证稀疏到密集奖励原则

Beyond GRPO and On-Policy Distillation: An Empirical Sparse-to-Dense Reward Principle for Language-Model Post-Training

摘要

Support