將價值重新注入強化學習：通過統一大型語言模型推理器與驗證器實現更好的測試時擴展

摘要

現行用於微調大型語言模型（LLM）推理器的強化學習（RL）方法，如GRPO或Leave-one-out PPO，往往捨棄已學習的價值函數，轉而依賴經驗估計的回報。這種做法阻礙了依賴價值函數進行驗證的測試階段計算擴展。在本研究中，我們提出了RL^V方法，該方法通過聯合訓練LLM作為推理器和生成式驗證器，利用RL生成的數據，為任何「無價值」的RL方法增添了驗證能力，且無需顯著增加額外開銷。實驗表明，RL^V在並行採樣下將MATH準確率提升了超過20%，並使測試階段計算效率相較於基礎RL方法提高了8至32倍。此外，RL^V在易至難任務及跨領域任務上展現出強大的泛化能力。更進一步，RL^V在並行與序列測試階段計算聯合擴展時，配合長推理R1模型，實現了1.2至1.6倍的性能提升。

English

Prevalent reinforcement learning~(RL) methods for fine-tuning LLM reasoners, such as GRPO or Leave-one-out PPO, abandon the learned value function in favor of empirically estimated returns. This hinders test-time compute scaling that relies on using the value-function for verification. In this work, we propose RL^V that augments any ``value-free'' RL method by jointly training the LLM as both a reasoner and a generative verifier using RL-generated data, adding verification capabilities without significant overhead. Empirically, RL^V boosts MATH accuracy by over 20\% with parallel sampling and enables 8-32times efficient test-time compute scaling compared to the base RL method. RL^V also exhibits strong generalization capabilities for both easy-to-hard and out-of-domain tasks. Furthermore, RL^V achieves 1.2-1.6times higher performance when jointly scaling parallel and sequential test-time compute with a long reasoning R1 model.

將價值重新注入強化學習：通過統一大型語言模型推理器與驗證器實現更好的測試時擴展

Putting the Value Back in RL: Better Test-Time Scaling by Unifying LLM Reasoners With Verifiers

摘要

Support