將價值重新注入強化學習:通過統一大型語言模型推理器與驗證器實現更好的測試時擴展
Putting the Value Back in RL: Better Test-Time Scaling by Unifying LLM Reasoners With Verifiers
May 7, 2025
作者: Kusha Sareen, Morgane M Moss, Alessandro Sordoni, Rishabh Agarwal, Arian Hosseini
cs.AI
摘要
現行用於微調大型語言模型(LLM)推理器的強化學習(RL)方法,如GRPO或Leave-one-out PPO,往往捨棄已學習的價值函數,轉而依賴經驗估計的回報。這種做法阻礙了依賴價值函數進行驗證的測試階段計算擴展。在本研究中,我們提出了RL^V方法,該方法通過聯合訓練LLM作為推理器和生成式驗證器,利用RL生成的數據,為任何「無價值」的RL方法增添了驗證能力,且無需顯著增加額外開銷。實驗表明,RL^V在並行採樣下將MATH準確率提升了超過20%,並使測試階段計算效率相較於基礎RL方法提高了8至32倍。此外,RL^V在易至難任務及跨領域任務上展現出強大的泛化能力。更進一步,RL^V在並行與序列測試階段計算聯合擴展時,配合長推理R1模型,實現了1.2至1.6倍的性能提升。
English
Prevalent reinforcement learning~(RL) methods for fine-tuning LLM reasoners,
such as GRPO or Leave-one-out PPO, abandon the learned value function in favor
of empirically estimated returns. This hinders test-time compute scaling that
relies on using the value-function for verification. In this work, we propose
RL^V that augments any ``value-free'' RL method by jointly training the LLM
as both a reasoner and a generative verifier using RL-generated data, adding
verification capabilities without significant overhead. Empirically, RL^V
boosts MATH accuracy by over 20\% with parallel sampling and enables
8-32times efficient test-time compute scaling compared to the base RL
method. RL^V also exhibits strong generalization capabilities for both
easy-to-hard and out-of-domain tasks. Furthermore, RL^V achieves
1.2-1.6times higher performance when jointly scaling parallel and sequential
test-time compute with a long reasoning R1 model.Summary
AI-Generated Summary