ChatPaper.aiChatPaper

經由最佳係數校準的強化學習中多標記預測的聯合訓練

Joint Training of Multi-Token Prediction in Reinforcement Learning via Optimal Coefficient Calibration

May 27, 2026
作者: Zili Wang, Jiajun Chai, Lin Chen, Xiaohan Wang, Shiming Xiang, Guojun Yin
cs.AI

摘要

基於可驗證獎勵的強化學習(RLVR)已成為提升大型語言模型推理能力的標準範式,而多令牌預測(MTP)則是預訓練中廣泛採用的模組。將兩者結合是直觀的思路,然而當前的強化學習實務中會將MTP梯度分離,因為聯合訓練會導致性能下降。我們從最佳化角度重新審視此失敗現象。我們證明,MTP對強化學習目標在每步的影響可分解為兩項:一階相關項與二階擾動懲罰項。此分解統一了三種MTP訓練模式:分離梯度、交叉熵損失與策略損失,並解釋了各模式成功或失敗的原因。進一步分析策略損失發現,儘管其符合直覺,但性能仍會下降:相關項衰減而二次懲罰項持續存在。在此分析指導下,我們提出最優係數校準(OCC)——一種自適應方案,透過對數概率代理在線追蹤最優係數,且計算成本極低。在六個競賽級數學推理基準上,OCC始終達到或超越分離梯度基線,實現了MTP與RL的聯合訓練性能提升。
English
Reinforcement Learning from Verifiable Rewards (RLVR) has emerged as the standard paradigm for improving reasoning capability of large language models, while Multi-Token Prediction (MTP) has been a widely adopted module in pretraining. Combining them is a natural approach, yet current RL practices detach MTP gradients because joint training degrades the performance. We revisit this failure from an optimization perspective. We show that the per-step effect of MTP on the RL objective can be decomposed into two terms: a first-order correlation and a second-order perturbation penalty. This decomposition unifies three MTP training regimes: Detach, Cross-Entropy loss, and Policy loss, and explains why each succeeds or fails. Further analysis of policy loss reveals that, although it aligns with intuition, performance still degrades: the correlation term decays while the quadratic penalty persists. Guided by the analysis, we propose Optimal Coefficient Calibration (OCC), an adaptive scheme that tracks the optimal coefficient online via a log-probability proxy at negligible cost. Across six competition-level mathematical reasoning benchmarks, OCC consistently matches or exceeds the detach baseline, delivering improved joint MTP-RL training performance.