經由最佳係數校準的強化學習中多標記預測的聯合訓練

摘要

基於可驗證獎勵的強化學習（RLVR）已成為提升大型語言模型推理能力的標準範式，而多令牌預測（MTP）則是預訓練中廣泛採用的模組。將兩者結合是直觀的思路，然而當前的強化學習實務中會將MTP梯度分離，因為聯合訓練會導致性能下降。我們從最佳化角度重新審視此失敗現象。我們證明，MTP對強化學習目標在每步的影響可分解為兩項：一階相關項與二階擾動懲罰項。此分解統一了三種MTP訓練模式：分離梯度、交叉熵損失與策略損失，並解釋了各模式成功或失敗的原因。進一步分析策略損失發現，儘管其符合直覺，但性能仍會下降：相關項衰減而二次懲罰項持續存在。在此分析指導下，我們提出最優係數校準（OCC）——一種自適應方案，透過對數概率代理在線追蹤最優係數，且計算成本極低。在六個競賽級數學推理基準上，OCC始終達到或超越分離梯度基線，實現了MTP與RL的聯合訓練性能提升。

English

Reinforcement Learning from Verifiable Rewards (RLVR) has emerged as the standard paradigm for improving reasoning capability of large language models, while Multi-Token Prediction (MTP) has been a widely adopted module in pretraining. Combining them is a natural approach, yet current RL practices detach MTP gradients because joint training degrades the performance. We revisit this failure from an optimization perspective. We show that the per-step effect of MTP on the RL objective can be decomposed into two terms: a first-order correlation and a second-order perturbation penalty. This decomposition unifies three MTP training regimes: Detach, Cross-Entropy loss, and Policy loss, and explains why each succeeds or fails. Further analysis of policy loss reveals that, although it aligns with intuition, performance still degrades: the correlation term decays while the quadratic penalty persists. Guided by the analysis, we propose Optimal Coefficient Calibration (OCC), an adaptive scheme that tracks the optimal coefficient online via a log-probability proxy at negligible cost. Across six competition-level mathematical reasoning benchmarks, OCC consistently matches or exceeds the detach baseline, delivering improved joint MTP-RL training performance.