最適係数調整による強化学習におけるマルチトークン予測の共同訓練

要旨

検証可能な報酬を用いた強化学習（RLVR）は、大規模言語モデルの推論能力を向上させるための標準的なパラダイムとして登場し、一方でマルチトークン予測（MTP）は事前学習において広く採用されているモジュールである。これらを組み合わせることは自然なアプローチであるが、現在の強化学習の実践ではMTPの勾配を分離（detach）している。なぜなら、結合学習を行うと性能が低下するからである。本稿では、この失敗を最適化の観点から再検討する。我々は、MTPが強化学習の目的関数に与える1ステップごとの影響が、一次相関項と二次摂動ペナルティ項の二つに分解できることを示す。この分解により、Detach、交差エントロピー損失、方策損失という三つのMTP訓練方式が統一的に説明され、それぞれの成功・失敗の理由が明らかになる。さらに方策損失の分析から、直感に合致しているにもかかわらず性能が低下する理由が明らかになる。すなわち、相関項は減衰する一方で二次ペナルティは持続するのである。この分析に基づき、我々は最適係数キャリブレーション（OCC）を提案する。これは、対数確率プロキシを介してオンラインで最適係数を追跡する適応方式であり、そのコストは無視できるほど小さい。6つの競技レベルの数学的推論ベンチマークにおいて、OCCは一貫して分離ベースラインと同等以上の性能を達成し、MTPと強化学習の結合訓練の性能を向上させる。

English

Reinforcement Learning from Verifiable Rewards (RLVR) has emerged as the standard paradigm for improving reasoning capability of large language models, while Multi-Token Prediction (MTP) has been a widely adopted module in pretraining. Combining them is a natural approach, yet current RL practices detach MTP gradients because joint training degrades the performance. We revisit this failure from an optimization perspective. We show that the per-step effect of MTP on the RL objective can be decomposed into two terms: a first-order correlation and a second-order perturbation penalty. This decomposition unifies three MTP training regimes: Detach, Cross-Entropy loss, and Policy loss, and explains why each succeeds or fails. Further analysis of policy loss reveals that, although it aligns with intuition, performance still degrades: the correlation term decays while the quadratic penalty persists. Guided by the analysis, we propose Optimal Coefficient Calibration (OCC), an adaptive scheme that tracks the optimal coefficient online via a log-probability proxy at negligible cost. Across six competition-level mathematical reasoning benchmarks, OCC consistently matches or exceeds the detach baseline, delivering improved joint MTP-RL training performance.