강화 학습에서 최적 계수 보정을 통한 다중 토큰 예측의 공동 훈련

초록

검증 가능한 보상으로부터의 강화 학습(RLVR)은 대규모 언어 모델의 추론 능력을 향상시키기 위한 표준 패러다임으로 자리 잡았으며, 다중 토큰 예측(MTP)은 사전 학습에서 널리 채택된 모듈이다. 이 둘을 결합하는 것은 자연스러운 접근법이나, 현재의 강화 학습 관행에서는 공동 훈련이 성능을 저하시키기 때문에 MTP 그래디언트를 분리한다. 우리는 최적화 관점에서 이러한 실패를 재검토한다. MTP가 강화 학습 목적 함수에 미치는 단계별 효과가 1차 상관 항과 2차 섭동 패널티 항으로 분해될 수 있음을 보인다. 이러한 분해는 Detach, 교차 엔트로피 손실, 정책 손실이라는 세 가지 MTP 훈련 방식을 통합하고, 각 방식이 성공하거나 실패하는 이유를 설명한다. 정책 손실에 대한 추가 분석은 직관과 일치함에도 불구하고 성능이 여전히 저하된다는 점을 밝혀낸다: 상관 항은 감소하는 반면 2차 패널티는 지속된다. 이 분석에 기반하여, 우리는 최적 계수 보정(OCC)을 제안한다. 이는 로그 확률 프록시를 통해 온라인으로 최적 계수를 추적하는 적응형 기법으로, 비용이 거의 들지 않는다. 여섯 개의 경쟁 수준 수학적 추론 벤치마크에서 OCC는 일관되게 분리 기준선과 동등하거나 더 나은 성능을 보이며, 개선된 공동 MTP-RL 훈련 성능을 제공한다.

English

Reinforcement Learning from Verifiable Rewards (RLVR) has emerged as the standard paradigm for improving reasoning capability of large language models, while Multi-Token Prediction (MTP) has been a widely adopted module in pretraining. Combining them is a natural approach, yet current RL practices detach MTP gradients because joint training degrades the performance. We revisit this failure from an optimization perspective. We show that the per-step effect of MTP on the RL objective can be decomposed into two terms: a first-order correlation and a second-order perturbation penalty. This decomposition unifies three MTP training regimes: Detach, Cross-Entropy loss, and Policy loss, and explains why each succeeds or fails. Further analysis of policy loss reveals that, although it aligns with intuition, performance still degrades: the correlation term decays while the quadratic penalty persists. Guided by the analysis, we propose Optimal Coefficient Calibration (OCC), an adaptive scheme that tracks the optimal coefficient online via a log-probability proxy at negligible cost. Across six competition-level mathematical reasoning benchmarks, OCC consistently matches or exceeds the detach baseline, delivering improved joint MTP-RL training performance.