場所だけでなく、時間：RLVRのための時間スケジューリング

要旨

検証可能な報酬を用いた強化学習（RLVR）は、大規模言語モデル（LLM）の事後訓練における中核的手法となっている。政策最適化は、大域的に放送されるスカラー報酬のもとでサンプリングされた全トークンによって駆動されるが、軌跡に沿って示される異質な政策行動は、区別されることなくほとんど見過ごされている。既存研究では、トークンレベルのアドバンテージ再重み付けや選択的トークン最適化を含む信用割り当てによってこの問題に対処しているが、割り当て基準は訓練全体を通じて基本的に固定的であり、弾力的な政策進化を制限している。本稿では、学習信号がいつスケジューリングされるかが、それらがトークン間でどこに割り当てられるかと同様に重要であると主張し、RLVR最適化の過程で信用割り当て基準をスケジューリングする時間的次元を導入する。特定の政策行動で強調された標的トークンを優先し、徐々に一般的最適化へと減衰させることで、より安定かつ効率的な学習ダイナミクスが得られることを見出す。さらに、単純な軌跡パーセンタイルが政策行動を区別する自然な視点を提供し、時間的スケジューリングと効果的に機能することを示す。分析により、標準的最適化では異質な行動を同時に扱う際に方策エントロピーを大幅に犠牲にするのに対し、時間的スケジューリングはより健全な政策進化ダイナミクスをもたらすことが明らかになった。数学的および一般的推論ベンチマークでの実験は一貫した改善を示しており、時間的スケジューリングが有望な最適化次元を構成することを示唆している。

English

Reinforcement learning with verifiable rewards (RLVR) has become a core technique for post-training of Large Language Models (LLMs). While policy optimization is driven by all sampled tokens under a globally broadcast scalar reward, the heterogeneous policy behaviors exhibited along trajectories are largely overlooked without differentiation. Existing works address this by credit allocation, including token-level advantage reweighting, and selective token optimization, however, the allocation criterion are principally stagnant throughout training, limiting resilient policy evolution. In this work, we argue that when learning signals are scheduled can be as important as where they are allocated across tokens, and introduce the temporal dimension that scheduling the credit allocation criteria over the course of RLVR optimization. We find that prioritizing targeted tokens emphasized with specific policy behaviors, and gradually attenuating toward general optimization leads to more stable and efficient learning dynamics. Furthermore, we show that simple trajectory percentiles provide a natural perspective for distinguishing policy behaviors, and works effectively with temporal scheduling. Our analysis reveals that standard optimization substantially sacrifices policy entropy when simultaneously accommodating heterogeneous behaviors, whereas temporal scheduling yields healthier policy evolution dynamics. Experiments across mathematical and general reasoning benchmarks demonstrate consistent improvements, suggesting that temporal scheduling constitutes a promising optimization dimension.