基于最优系数校准的强化学习多令牌预测联合训练

摘要

基于可验证奖励的强化学习（RLVR）已成为提升大型语言模型推理能力的标准范式，而多令牌预测（MTP）则是预训练中广泛采用的模块。将二者结合本是自然之举，但当前的强化学习实践中常分离MTP梯度，因为联合训练会导致性能下降。我们从优化视角重新审视这一失败案例，发现MTP对强化学习目标每步效应可分解为两项：一阶相关项与二阶扰动惩罚项。该分解统一了三种MTP训练模式——分离模式、交叉熵损失与策略损失，并解释了各自成败的原因。对策略损失的进一步分析表明，尽管其符合直觉，但性能仍会退化：相关项衰减而二次惩罚项持续存在。基于分析结果，我们提出最优系数校准（OCC）方法——一种通过对数概率代理在线追踪最优系数的自适应方案，且计算成本极低。在六个竞赛级数学推理基准测试中，OCC持续达到或超越分离基线，显著提升了MTP-强化学习联合训练的性能。

English

Reinforcement Learning from Verifiable Rewards (RLVR) has emerged as the standard paradigm for improving reasoning capability of large language models, while Multi-Token Prediction (MTP) has been a widely adopted module in pretraining. Combining them is a natural approach, yet current RL practices detach MTP gradients because joint training degrades the performance. We revisit this failure from an optimization perspective. We show that the per-step effect of MTP on the RL objective can be decomposed into two terms: a first-order correlation and a second-order perturbation penalty. This decomposition unifies three MTP training regimes: Detach, Cross-Entropy loss, and Policy loss, and explains why each succeeds or fails. Further analysis of policy loss reveals that, although it aligns with intuition, performance still degrades: the correlation term decays while the quadratic penalty persists. Guided by the analysis, we propose Optimal Coefficient Calibration (OCC), an adaptive scheme that tracks the optimal coefficient online via a log-probability proxy at negligible cost. Across six competition-level mathematical reasoning benchmarks, OCC consistently matches or exceeds the detach baseline, delivering improved joint MTP-RL training performance.