不仅关注位置，还关注时间：面向RLVR的时间调度

摘要

基于可验证奖励的强化学习（RLVR）已成为大型语言模型（LLM）后训练的核心技术。尽管策略优化由所有采样令牌在全局广播的标量奖励驱动下进行，但轨迹中呈现的异质性策略行为在很大程度上被忽视，未加以区分。现有工作通过信用分配来处理这一问题，包括令牌级优势重加权和选择性令牌优化，然而，这些分配准则在整个训练过程中基本停滞不变，限制了策略的弹性演化。在本工作中，我们认为学习信号的调度时机与信号在令牌间的分配位置同样重要，并引入了时间维度，即在RLVR优化过程中动态调度信用分配准则。我们发现，优先关注具有特定策略行为的目标令牌，并逐渐衰减至通用优化，能够带来更稳定且高效的学习动态。此外，我们证明了简单的轨迹百分位数可以为区分策略行为提供自然视角，并与时间调度有效配合。我们的分析表明，标准优化在同时容纳异质性行为时会显著牺牲策略熵，而时间调度则产生更健康的策略演化动态。在数学和通用推理基准上的实验显示出一致的改进，表明时间调度构成了一个有前景的优化维度。

English

Reinforcement learning with verifiable rewards (RLVR) has become a core technique for post-training of Large Language Models (LLMs). While policy optimization is driven by all sampled tokens under a globally broadcast scalar reward, the heterogeneous policy behaviors exhibited along trajectories are largely overlooked without differentiation. Existing works address this by credit allocation, including token-level advantage reweighting, and selective token optimization, however, the allocation criterion are principally stagnant throughout training, limiting resilient policy evolution. In this work, we argue that when learning signals are scheduled can be as important as where they are allocated across tokens, and introduce the temporal dimension that scheduling the credit allocation criteria over the course of RLVR optimization. We find that prioritizing targeted tokens emphasized with specific policy behaviors, and gradually attenuating toward general optimization leads to more stable and efficient learning dynamics. Furthermore, we show that simple trajectory percentiles provide a natural perspective for distinguishing policy behaviors, and works effectively with temporal scheduling. Our analysis reveals that standard optimization substantially sacrifices policy entropy when simultaneously accommodating heterogeneous behaviors, whereas temporal scheduling yields healthier policy evolution dynamics. Experiments across mathematical and general reasoning benchmarks demonstrate consistent improvements, suggesting that temporal scheduling constitutes a promising optimization dimension.