让强化学习重拾价值:通过统一大语言模型推理器与验证器实现更优的测试时扩展
Putting the Value Back in RL: Better Test-Time Scaling by Unifying LLM Reasoners With Verifiers
May 7, 2025
作者: Kusha Sareen, Morgane M Moss, Alessandro Sordoni, Rishabh Agarwal, Arian Hosseini
cs.AI
摘要
当前用于微调大语言模型(LLM)推理器的强化学习(RL)方法,如GRPO或留一PPO,往往舍弃已学习的价值函数,转而依赖经验估计的回报。这种做法阻碍了测试时计算效率的提升,因为后者通常需要利用价值函数进行验证。在本研究中,我们提出了RL^V方法,它通过联合训练LLM作为推理器和生成验证器,利用RL生成的数据,为任何“无价值”RL方法增添验证能力,且不引入显著开销。实验表明,RL^V在并行采样下将MATH准确率提升了超过20%,并实现了相较于基础RL方法8至32倍的测试时计算效率提升。此外,RL^V在从易到难及跨领域任务上展现出强大的泛化能力。更为突出的是,当结合并行与顺序测试时计算进行联合扩展时,RL^V在长推理R1模型上实现了1.2至1.6倍的性能提升。
English
Prevalent reinforcement learning~(RL) methods for fine-tuning LLM reasoners,
such as GRPO or Leave-one-out PPO, abandon the learned value function in favor
of empirically estimated returns. This hinders test-time compute scaling that
relies on using the value-function for verification. In this work, we propose
RL^V that augments any ``value-free'' RL method by jointly training the LLM
as both a reasoner and a generative verifier using RL-generated data, adding
verification capabilities without significant overhead. Empirically, RL^V
boosts MATH accuracy by over 20\% with parallel sampling and enables
8-32times efficient test-time compute scaling compared to the base RL
method. RL^V also exhibits strong generalization capabilities for both
easy-to-hard and out-of-domain tasks. Furthermore, RL^V achieves
1.2-1.6times higher performance when jointly scaling parallel and sequential
test-time compute with a long reasoning R1 model.Summary
AI-Generated Summary