V_{0.5}: Generalistisch waardemodel als prior voor sparse RL-rollouts

Samenvatting

Bij Reinforcement Learning met Verifieerbare Beloningen (RLVR) is de constructie van een robuuste advantage-baseline cruciaal voor policy gradients, omdat deze het beleidsmodel effectief leidt om gewenst gedrag te versterken. Recent onderzoek heeft Generalistische Waardemodellen (zoals V₀) geïntroduceerd, die vooraf getrainde waarde-inschatting bereiken door modelcapaciteiten expliciet in-context te coderen, waardoor synchrone updates van het waardemodel naast het beleidsmodel overbodig worden. In dit artikel stellen we V₀.₅ voor, dat de door een dergelijk waardemodel voorspelde baseline (die als prior fungeert) adaptief samenvoegt met het empirische gemiddelde afgeleid van sparse rollouts. Dit construeert een robuuste baseline die computationele efficiëntie balanceert met een extreem lage variantie. Concreet introduceren we een real-time statistische toetsing en dynamische budgetallocatie. Dit balanceert de hoge variantie door sparse sampling tegen de systematische bias (of hallucinaties) inherent aan de prior van het waardemodel. Door een hypothesetoets te construeren die de betrouwbaarheid van de prior real-time evalueert, alloceert het systeem dynamisch extra rollout-budget op aanvraag. Dit mechanisme minimaliseert de Mean Squared Error (MSE) van de baseline-schatter en garandeert stabiele policy gradients, zelfs onder extreme sparse condities met een groepsgrootte van 4. Uitgebreide evaluaties op zes wiskundige redeneerbenchmarks tonen aan dat V₀.₅ significant beter presteert dan GRPO en DAPO, met een snellere convergentie en een prestatieverbetering van ongeveer 10%.

English

In Reinforcement Learning with Verifiable Rewards (RLVR), constructing a robust advantage baseline is critical for policy gradients, effectively guiding the policy model to reinforce desired behaviors. Recent research has introduced Generalist Value Models (such as V_0), which achieve pre-trained value estimation by explicitly encoding model capabilities in-context, eliminating the need to synchronously update the value model alongside the policy model. In this paper, we propose V_{0.5}, which adaptively fuses the baseline predicted by such value model (acting as a prior) with the empirical mean derived from sparse rollouts. This constructs a robust baseline that balances computational efficiency with extremely low variance. Specifically, we introduce a real-time statistical testing and dynamic budget allocation. This balances the high variance caused by sparse sampling against the systematic bias (or hallucinations) inherent in the value model's prior. By constructing a hypothesis test to evaluate the prior's reliability in real-time, the system dynamically allocates additional rollout budget on demand. This mechanism minimizes the baseline estimator's Mean Squared Error (MSE), guaranteeing stable policy gradients, even under extreme sparsity with a group size of 4. Extensive evaluations across six mathematical reasoning benchmarks demonstrate that V_{0.5} significantly outperforms GRPO and DAPO, achieving faster convergence and over some 10% performance improvement.

V_{0.5}: Generalistisch waardemodel als prior voor sparse RL-rollouts

V_{0.5}: Generalist Value Model as a Prior for Sparse RL Rollouts

Samenvatting

Support