筛选后重新加权：反思同策略蒸馏中的优化粒度

摘要

在大语言模型的同策略蒸馏（OPD）中，训练范式正从全迹KL监督转向更具选择性的方法。近年来，OPD方法越来越关注如何选择学习轨迹、哪些令牌最具信息量以及哪些监督信号最为可靠。受此趋势启发，我们重新审视了OPD的优化粒度，并提出\fireicon\ FiRe-OPD（过滤后重加权），该方法在轨迹和令牌两个层面联合调整监督信号。具体而言，FiRe-OPD首先通过过滤轨迹去除低质量生成样本，然后对保留轨迹内的令牌应用软重加权机制，以突出信息性较强的令牌。与硬性令牌选择相比，FiRe-OPD利用软加权机制有效减轻信息损失并提升优化稳定性，从而实现更细粒度的OPD优化。我们在强到弱、单教师和多教师设置下验证了FiRe-OPD的有效性，并展示了其相较于近期令牌级OPD方法的优越性（例如，在强到弱设置下AIME 2024提升6.25分，在多教师设置下Miner提升18.81分）。我们的代码已开源：https://github.com/YuYingLi0/FiRe-OPD。

English

On-Policy distillation (OPD) in large language models is shifting from full-trace KL supervision toward more selective training paradigms. Recent OPD methods increasingly focus on selecting which trajectories to learn from, which tokens are most informative, and which supervision signals are most reliable. Motivated by this trend, we rethink optimization granularity of OPD and propose \fireicon\ FiRe-OPD (Filter, then Reweight), which jointly adjusts supervision signals at both trajectory and token levels. In details, FiRe-OPD first filters trajectories to remove low-quality rollout samples, and then applies soft reweighting within the retained trajectories to emphasize informative tokens. Compared with hard token selection, FiRe-OPD leverages a soft-weighting mechanism to effectively mitigate information loss and enhance optimization stability, thereby achieving finer-grained OPD optimization. We validate the effectiveness of FiRe-OPD across strong-to-weak, single-teacher, and multi-teacher settings, and demonstrate its superiority over recent token-level OPD methods ( (e.g., +6.25 on AIME 2024 in strong-to-weak, +18.81 on Miner in multi-teacher). Our code is available at https://github.com/YuYingLi0/FiRe-OPD.