RLPR：将RLVR推广至无需验证器的通用领域

摘要

基于可验证奖励的强化学习（RLVR）在提升大语言模型（LLMs）的推理能力方面展现出显著潜力。然而，其成功目前主要局限于数学和代码领域。这一主要限制源于对领域特定验证器的严重依赖，导致了极高的复杂性和有限的可扩展性。为解决这一挑战，我们的关键观察是：LLM生成正确自由形式答案的内在概率直接反映了其对推理奖励的自我评估（即推理过程导向正确答案的程度）。基于这一洞见，我们提出了RLPR，一个无需验证器的简单框架，将RLVR推广至更广泛的通用领域。RLPR利用LLM自身对参考答案的token概率分数作为奖励信号，并在训练过程中最大化预期奖励。我们发现，解决这一噪声概率奖励的高方差问题对于其有效性至关重要，因此提出了概率到奖励的转换方法和稳定化技术，以确保从LLM内在概率中获取精确且稳定的奖励。在四个通用领域基准和三个数学基准上的全面实验表明，RLPR持续提升了基于Gemma、Llama和Qwen模型的推理能力。值得注意的是，RLPR在TheoremQA上比同期VeriFree高出7.6分，在Minerva上高出7.5分，甚至在七个基准上平均超越依赖验证器模型的General-Reasoner方法1.6分。

English

Reinforcement Learning with Verifiable Rewards (RLVR) demonstrates promising potential in advancing the reasoning capabilities of LLMs. However, its success remains largely confined to mathematical and code domains. This primary limitation stems from the heavy reliance on domain-specific verifiers, which results in prohibitive complexity and limited scalability. To address the challenge, our key observation is that LLM's intrinsic probability of generating a correct free-form answer directly indicates its own evaluation of the reasoning reward (i.e., how well the reasoning process leads to the correct answer). Building on this insight, we propose RLPR, a simple verifier-free framework that extrapolates RLVR to broader general domains. RLPR uses the LLM's own token probability scores for reference answers as the reward signal and maximizes the expected reward during training. We find that addressing the high variance of this noisy probability reward is crucial to make it work, and propose prob-to-reward and stabilizing methods to ensure a precise and stable reward from LLM intrinsic probabilities. Comprehensive experiments in four general-domain benchmarks and three mathematical benchmarks show that RLPR consistently improves reasoning capabilities in both areas for Gemma, Llama, and Qwen based models. Notably, RLPR outperforms concurrent VeriFree by 7.6 points on TheoremQA and 7.5 points on Minerva, and even surpasses strong verifier-model-dependent approaches General-Reasoner by 1.6 average points across seven benchmarks.

RLPR：将RLVR推广至无需验证器的通用领域

RLPR: Extrapolating RLVR to General Domains without Verifiers

摘要

Support