LLM 강화 학습에서 균일한 토큰 수준 신뢰 영역을 넘어서

초록

검증 가능한 보상을 통한 강화 학습(RLVR)은 LLM 추론을 개선하기 위한 표준 방식이 되었다. 그러나 기존의 PPO 스타일 신뢰 영역 메커니즘은 모든 토큰에 대해 독립적으로 균일한 임계값을 적용하여 위치에 무관한 상태로 남아 있다. 이러한 점별 처리는 두 가지 중요한 측면에서 자기회귀 생성과 충돌한다. 첫째, 균일한 임계값은 자기회귀 비대칭성을 무시한다. 초기 단계의 편차는 누적되는 시퀀스 수준 드리프트를 유발하여, 정적 임계값이 초기 발산을 충분히 규제하지 못하고 후기 단계 탐색을 과도하게 제한하게 된다. 둘째, 토큰 수준 발산을 개별적으로 평가하는 것은 누적된 프리픽스 드리프트를 간과하여, 조건화 이력이 롤아웃 정책에서 얼마나 벗어났는지와 관계없이 동일한 발산 허용치를 부여한다. 이러한 한계를 해결하기 위해, 우리는 두 가지 결합 메커니즘을 통해 업데이트를 유한 수평선 정책 개선 한계에 맞추는 토큰 수준 마스킹 규칙인 CPPO(누적 프리픽스 발산 정책 최적화)를 제안한다. 첫째, 위치 가중 임계값은 효과가 더 오래 지속되는 초기 위치에 더 엄격한 제한을 적용하고 후기 단계 토큰에 대한 제약을 완화한다. 둘째, 누적 프리픽스 예산은 역사적 편차를 추적하여 동적으로 추가 토큰 수준 편차를 제한함으로써 프리픽스를 따라 오류가 누적되는 것을 방지한다. 실험적으로 CPPO는 훈련 안정성을 향상시키고 다양한 모델 규모에서 추론 정확도를 크게 개선한다.

English

Reinforcement learning with verifiable rewards (RLVR) has become standard for improving LLM reasoning. However, existing PPO-style trust-region mechanisms remain position-agnostic by enforcing uniform thresholds across all tokens independently. This pointwise treatment conflicts with autoregressive generation in two critical ways. First, uniform thresholds ignore autoregressive asymmetry. Early-stage deviations produce compounding sequence-level drift, causing static thresholds to under-regulate early divergence and excessively constrain late-stage exploration. Second, evaluating token-level divergence in isolation overlooks cumulative prefix drift, granting the same divergence allowance regardless of how far the conditioning history has already deviated from the rollout policy. To address this limitation, we propose CPPO (Cumulative Prefix-divergence Policy Optimization), a token-level masking rule that aligns updates with a finite-horizon policy-improvement bound via two coupled mechanisms. First, a position-weighted threshold imposes stricter limits at early positions whose effects persist longer, relaxing constraints for late-stage tokens. Second, a cumulative prefix budget tracks historical deviations, dynamically restricting further token-level deviation to prevent compounding errors along the prefix. Empirically, CPPO enhances training stability and significantly improves reasoning accuracy across various model scales.