基于最优系数校准的强化学习多令牌预测联合训练
Joint Training of Multi-Token Prediction in Reinforcement Learning via Optimal Coefficient Calibration
May 27, 2026
作者: Zili Wang, Jiajun Chai, Lin Chen, Xiaohan Wang, Shiming Xiang, Guojun Yin
cs.AI
摘要
基于可验证奖励的强化学习(RLVR)已成为提升大型语言模型推理能力的标准范式,而多令牌预测(MTP)则是预训练中广泛采用的模块。将二者结合本是自然之举,但当前的强化学习实践中常分离MTP梯度,因为联合训练会导致性能下降。我们从优化视角重新审视这一失败案例,发现MTP对强化学习目标每步效应可分解为两项:一阶相关项与二阶扰动惩罚项。该分解统一了三种MTP训练模式——分离模式、交叉熵损失与策略损失,并解释了各自成败的原因。对策略损失的进一步分析表明,尽管其符合直觉,但性能仍会退化:相关项衰减而二次惩罚项持续存在。基于分析结果,我们提出最优系数校准(OCC)方法——一种通过对数概率代理在线追踪最优系数的自适应方案,且计算成本极低。在六个竞赛级数学推理基准测试中,OCC持续达到或超越分离基线,显著提升了MTP-强化学习联合训练的性能。
English
Reinforcement Learning from Verifiable Rewards (RLVR) has emerged as the standard paradigm for improving reasoning capability of large language models, while Multi-Token Prediction (MTP) has been a widely adopted module in pretraining. Combining them is a natural approach, yet current RL practices detach MTP gradients because joint training degrades the performance. We revisit this failure from an optimization perspective. We show that the per-step effect of MTP on the RL objective can be decomposed into two terms: a first-order correlation and a second-order perturbation penalty. This decomposition unifies three MTP training regimes: Detach, Cross-Entropy loss, and Policy loss, and explains why each succeeds or fails. Further analysis of policy loss reveals that, although it aligns with intuition, performance still degrades: the correlation term decays while the quadratic penalty persists. Guided by the analysis, we propose Optimal Coefficient Calibration (OCC), an adaptive scheme that tracks the optimal coefficient online via a log-probability proxy at negligible cost. Across six competition-level mathematical reasoning benchmarks, OCC consistently matches or exceeds the detach baseline, delivering improved joint MTP-RL training performance.