不僅是位置,還有時間:RLVR的時序調度
Not only where, But when: Temporal Scheduling for RLVR
May 25, 2026
作者: Jinghao Zhang, Ruilin Li, Feng Zhao, Jiaqi Wang
cs.AI
摘要
基於可驗證獎勵的強化學習(RLVR)已成為大型語言模型(LLM)後訓練的核心技術。雖然策略優化是由全局廣播的標量獎勵驅動所有取樣詞元,但沿軌跡展現的異質策略行為在缺乏區分的情況下基本上被忽略。現有研究通過信用分配來解決這個問題,包括詞元層級的優勢重加權和選擇性詞元優化,然而,分配標準在訓練過程中基本保持不變,限制了策略的穩健演化。在這項工作中,我們認為學習信號的調度時機與它們在詞元間的分配位置同樣重要,並引入了時間維度,即在RLVR優化過程中對信用分配標準進行調度。我們發現優先處理具有特定策略行為的目標詞元,並逐步過渡到一般優化,可以帶來更穩定且高效的學習動態。此外,我們證明簡單的軌跡百分位數為區分策略行為提供了一個自然的視角,並且與時間調度配合良好。我們的分析揭示,標準優化在同時容納異質行為時會大幅犧牲策略熵,而時間調度則產生更健康的策略演化動態。跨數學和一般推理基準的實驗顯示出一致的改進,表明時間調度是一個有前景的優化維度。
English
Reinforcement learning with verifiable rewards (RLVR) has become a core technique for post-training of Large Language Models (LLMs). While policy optimization is driven by all sampled tokens under a globally broadcast scalar reward, the heterogeneous policy behaviors exhibited along trajectories are largely overlooked without differentiation. Existing works address this by credit allocation, including token-level advantage reweighting, and selective token optimization, however, the allocation criterion are principally stagnant throughout training, limiting resilient policy evolution. In this work, we argue that when learning signals are scheduled can be as important as where they are allocated across tokens, and introduce the temporal dimension that scheduling the credit allocation criteria over the course of RLVR optimization. We find that prioritizing targeted tokens emphasized with specific policy behaviors, and gradually attenuating toward general optimization leads to more stable and efficient learning dynamics. Furthermore, we show that simple trajectory percentiles provide a natural perspective for distinguishing policy behaviors, and works effectively with temporal scheduling. Our analysis reveals that standard optimization substantially sacrifices policy entropy when simultaneously accommodating heterogeneous behaviors, whereas temporal scheduling yields healthier policy evolution dynamics. Experiments across mathematical and general reasoning benchmarks demonstrate consistent improvements, suggesting that temporal scheduling constitutes a promising optimization dimension.