V_{0.5}: 희소 RL 롤아웃을 위한 사전 분포로서의 일반적 가치 모델

초록

검증 가능한 보상을 활용한 강화 학습(RLVR)에서 강력한 어드밴티지 기준선을 구축하는 것은 정책 경사법에 있어 핵심적이며, 정책 모델이 원하는 행동을 강화하도록 효과적으로 유도합니다. 최근 연구에서는 일반적 가치 모델(V_0 등)이 도입되었는데, 이는 모델 능력을 컨텍스트 내에서 명시적으로 인코딩하여 사전 학습된 가치 추정을 달성함으로써 정책 모델과 동기화하여 가치 모델을 업데이트할 필요를 없앱니다. 본 논문에서는 이러한 가치 모델이 예측한 기준선(사전 정보 역할)과 희소 롤아웃에서 도출된 경험적 평균을 적응적으로 융합하는 V_{0.5}를 제안합니다. 이를 통해 계산 효율성과 극도로 낮은 분산을 균형 있게 맞추는 강력한 기준선을 구축합니다. 구체적으로, 우리는 실시간 통계 검정과 동적 예산 할당을 도입합니다. 이는 희소 샘플링으로 인한 높은 분산과 가치 모델의 사전 정보에 내재된 시스템적 편향(또는 환각)을 상쇄합니다. 사전 정보의 신뢰도를 실시간으로 평가하기 위한 가설 검정을 구성함으로써, 시스템은 필요에 따라 추가 롤아웃 예산을 동적으로 할당합니다. 이 메커니즘은 기준선 추정기의 평균 제곱 오차(MSE)를 최소화하며, 그룹 크기가 4인 극한의 희소 조건 하에서도 안정적인 정책 경사를 보장합니다. 6개의 수학적 추론 벤치마크에 걸친 폭넓은 평가 결과, V_{0.5}가 GRPO 및 DAPO를 크게 능가하며 더 빠른 수렴과 약 10% 이상의 성능 향상을 달성함을 입증했습니다.

English

In Reinforcement Learning with Verifiable Rewards (RLVR), constructing a robust advantage baseline is critical for policy gradients, effectively guiding the policy model to reinforce desired behaviors. Recent research has introduced Generalist Value Models (such as V_0), which achieve pre-trained value estimation by explicitly encoding model capabilities in-context, eliminating the need to synchronously update the value model alongside the policy model. In this paper, we propose V_{0.5}, which adaptively fuses the baseline predicted by such value model (acting as a prior) with the empirical mean derived from sparse rollouts. This constructs a robust baseline that balances computational efficiency with extremely low variance. Specifically, we introduce a real-time statistical testing and dynamic budget allocation. This balances the high variance caused by sparse sampling against the systematic bias (or hallucinations) inherent in the value model's prior. By constructing a hypothesis test to evaluate the prior's reliability in real-time, the system dynamically allocates additional rollout budget on demand. This mechanism minimizes the baseline estimator's Mean Squared Error (MSE), guaranteeing stable policy gradients, even under extreme sparsity with a group size of 4. Extensive evaluations across six mathematical reasoning benchmarks demonstrate that V_{0.5} significantly outperforms GRPO and DAPO, achieving faster convergence and over some 10% performance improvement.

V_{0.5}: 희소 RL 롤아웃을 위한 사전 분포로서의 일반적 가치 모델

V_{0.5}: Generalist Value Model as a Prior for Sparse RL Rollouts

초록

Support