위치뿐만 아니라 시간: RLVR을 위한 시간적 스케줄링

초록

보상 검증이 가능한 강화 학습(RLVR)은 대규모 언어 모델(LLM)의 사후 훈련(post-training)을 위한 핵심 기술이 되었다. 정책 최적화(policy optimization)는 전역적으로 전파되는 스칼라 보상 하에 샘플링된 모든 토큰에 의해 추진되지만, 궤적(trajectory)을 따라 나타나는 이질적인 정책 행동들은 차별 없이 대부분 간과된다. 기존 연구들은 토큰 수준의 이점 재가중치 부여(token-level advantage reweighting) 및 선택적 토큰 최적화(selective token optimization)를 포함한 신용 할당(credit allocation)을 통해 이 문제를 해결하고자 했으나, 할당 기준은 훈련 전반에 걸쳐 정체되어 있어 탄력적인 정책 진화를 제한한다. 본 연구에서는 학습 신호가 언제 할당되는지가 토큰 간 어디에 할당되는지만큼 중요하다고 주장하며, RLVR 최적화 과정에서 신용 할당 기준을 조정하는 시간적 차원(temporal dimension)을 도입한다. 특정 정책 행동이 강조된 표적 토큰(targeted token)을 우선시하고, 점차 일반 최적화로 약화시키는 접근법이 더욱 안정적이고 효율적인 학습 역학을 유도함을 발견했다. 또한, 단순한 궤적 백분위수(trajectory percentile)가 정책 행동을 구분하는 자연스러운 관점을 제공하며, 시간적 조정과 함께 효과적으로 작동함을 보인다. 분석 결과, 표준 최적화는 이질적 행동을 동시에 수용할 때 정책 엔트로피(policy entropy)를 상당히 희생하는 반면, 시간적 조정은 더 건강한 정책 진화 역학을 유도함을 밝혔다. 수학 및 일반 추론 벤치마크에 걸친 실험 결과 일관된 개선이 관찰되었으며, 이는 시간적 조정이 유망한 최적화 차원을 구성함을 시사한다.

English

Reinforcement learning with verifiable rewards (RLVR) has become a core technique for post-training of Large Language Models (LLMs). While policy optimization is driven by all sampled tokens under a globally broadcast scalar reward, the heterogeneous policy behaviors exhibited along trajectories are largely overlooked without differentiation. Existing works address this by credit allocation, including token-level advantage reweighting, and selective token optimization, however, the allocation criterion are principally stagnant throughout training, limiting resilient policy evolution. In this work, we argue that when learning signals are scheduled can be as important as where they are allocated across tokens, and introduce the temporal dimension that scheduling the credit allocation criteria over the course of RLVR optimization. We find that prioritizing targeted tokens emphasized with specific policy behaviors, and gradually attenuating toward general optimization leads to more stable and efficient learning dynamics. Furthermore, we show that simple trajectory percentiles provide a natural perspective for distinguishing policy behaviors, and works effectively with temporal scheduling. Our analysis reveals that standard optimization substantially sacrifices policy entropy when simultaneously accommodating heterogeneous behaviors, whereas temporal scheduling yields healthier policy evolution dynamics. Experiments across mathematical and general reasoning benchmarks demonstrate consistent improvements, suggesting that temporal scheduling constitutes a promising optimization dimension.