Gezamenlijke training van multi-tokenvoorspelling in reinforcement learning via optimale coëfficiëntkalibratie

Samenvatting

Reinforcement Learning van Verifieerbare Beloningen (RLVR) is uitgegroeid tot het standaardparadigma voor het verbeteren van het redeneervermogen van grote taalmodellen, terwijl Multi-Token Voorspelling (MTP) een veelgebruikte module is in pretraining. Het combineren ervan is een voor de hand liggende benadering, maar in de huidige RL-praktijk worden MTP-gradienten losgekoppeld omdat gezamenlijke training de prestaties verslechtert. We bekijken deze mislukking opnieuw vanuit een optimalisatieperspectief. We laten zien dat het per-stap-effect van MTP op de RL-doelfunctie kan worden opgesplitst in twee termen: een eerste-orde correlatie en een tweede-orde perturbatieboete. Deze decompositie verenigt drie MTP-trainingsregimes: Detach, Kruisentropieverlies en Beleidsverlies, en verklaart waarom elk ervan slaagt of faalt. Verdere analyse van beleidsverlies laat zien dat, hoewel het intuïtief aansluit, de prestaties nog steeds afnemen: de correlatieterm neemt af terwijl de kwadratische boete blijft bestaan. Geleid door de analyse stellen we Optimale Coëfficiëntkalibratie (OCC) voor, een adaptief schema dat de optimale coëfficiënt online volgt via een log-waarschijnlijkheid proxy met verwaarloosbare kosten. Over zes competitieniveau wiskundige redeneerbenchmarks presteert OCC consistent op of boven de Detach-baseline, wat leidt tot verbeterde gezamenlijke MTP-RL-trainingprestaties.

English

Reinforcement Learning from Verifiable Rewards (RLVR) has emerged as the standard paradigm for improving reasoning capability of large language models, while Multi-Token Prediction (MTP) has been a widely adopted module in pretraining. Combining them is a natural approach, yet current RL practices detach MTP gradients because joint training degrades the performance. We revisit this failure from an optimization perspective. We show that the per-step effect of MTP on the RL objective can be decomposed into two terms: a first-order correlation and a second-order perturbation penalty. This decomposition unifies three MTP training regimes: Detach, Cross-Entropy loss, and Policy loss, and explains why each succeeds or fails. Further analysis of policy loss reveals that, although it aligns with intuition, performance still degrades: the correlation term decays while the quadratic penalty persists. Guided by the analysis, we propose Optimal Coefficient Calibration (OCC), an adaptive scheme that tracks the optimal coefficient online via a log-probability proxy at negligible cost. Across six competition-level mathematical reasoning benchmarks, OCC consistently matches or exceeds the detach baseline, delivering improved joint MTP-RL training performance.