让强化学习重拾价值：通过统一大语言模型推理器与验证器实现更优的测试时扩展

摘要

当前用于微调大语言模型（LLM）推理器的强化学习（RL）方法，如GRPO或留一PPO，往往舍弃已学习的价值函数，转而依赖经验估计的回报。这种做法阻碍了测试时计算效率的提升，因为后者通常需要利用价值函数进行验证。在本研究中，我们提出了RL^V方法，它通过联合训练LLM作为推理器和生成验证器，利用RL生成的数据，为任何“无价值”RL方法增添验证能力，且不引入显著开销。实验表明，RL^V在并行采样下将MATH准确率提升了超过20%，并实现了相较于基础RL方法8至32倍的测试时计算效率提升。此外，RL^V在从易到难及跨领域任务上展现出强大的泛化能力。更为突出的是，当结合并行与顺序测试时计算进行联合扩展时，RL^V在长推理R1模型上实现了1.2至1.6倍的性能提升。

English

Prevalent reinforcement learning~(RL) methods for fine-tuning LLM reasoners, such as GRPO or Leave-one-out PPO, abandon the learned value function in favor of empirically estimated returns. This hinders test-time compute scaling that relies on using the value-function for verification. In this work, we propose RL^V that augments any ``value-free'' RL method by jointly training the LLM as both a reasoner and a generative verifier using RL-generated data, adding verification capabilities without significant overhead. Empirically, RL^V boosts MATH accuracy by over 20\% with parallel sampling and enables 8-32times efficient test-time compute scaling compared to the base RL method. RL^V also exhibits strong generalization capabilities for both easy-to-hard and out-of-domain tasks. Furthermore, RL^V achieves 1.2-1.6times higher performance when jointly scaling parallel and sequential test-time compute with a long reasoning R1 model.

让强化学习重拾价值：通过统一大语言模型推理器与验证器实现更优的测试时扩展

Putting the Value Back in RL: Better Test-Time Scaling by Unifying LLM Reasoners With Verifiers

摘要

Support