RLPR:将RLVR推广至无需验证器的通用领域
RLPR: Extrapolating RLVR to General Domains without Verifiers
June 23, 2025
作者: Tianyu Yu, Bo Ji, Shouli Wang, Shu Yao, Zefan Wang, Ganqu Cui, Lifan Yuan, Ning Ding, Yuan Yao, Zhiyuan Liu, Maosong Sun, Tat-Seng Chua
cs.AI
摘要
基于可验证奖励的强化学习(RLVR)在提升大语言模型(LLMs)的推理能力方面展现出显著潜力。然而,其成功目前主要局限于数学和代码领域。这一主要限制源于对领域特定验证器的严重依赖,导致了极高的复杂性和有限的可扩展性。为解决这一挑战,我们的关键观察是:LLM生成正确自由形式答案的内在概率直接反映了其对推理奖励的自我评估(即推理过程导向正确答案的程度)。基于这一洞见,我们提出了RLPR,一个无需验证器的简单框架,将RLVR推广至更广泛的通用领域。RLPR利用LLM自身对参考答案的token概率分数作为奖励信号,并在训练过程中最大化预期奖励。我们发现,解决这一噪声概率奖励的高方差问题对于其有效性至关重要,因此提出了概率到奖励的转换方法和稳定化技术,以确保从LLM内在概率中获取精确且稳定的奖励。在四个通用领域基准和三个数学基准上的全面实验表明,RLPR持续提升了基于Gemma、Llama和Qwen模型的推理能力。值得注意的是,RLPR在TheoremQA上比同期VeriFree高出7.6分,在Minerva上高出7.5分,甚至在七个基准上平均超越依赖验证器模型的General-Reasoner方法1.6分。
English
Reinforcement Learning with Verifiable Rewards (RLVR) demonstrates promising
potential in advancing the reasoning capabilities of LLMs. However, its success
remains largely confined to mathematical and code domains. This primary
limitation stems from the heavy reliance on domain-specific verifiers, which
results in prohibitive complexity and limited scalability. To address the
challenge, our key observation is that LLM's intrinsic probability of
generating a correct free-form answer directly indicates its own evaluation of
the reasoning reward (i.e., how well the reasoning process leads to the correct
answer). Building on this insight, we propose RLPR, a simple verifier-free
framework that extrapolates RLVR to broader general domains. RLPR uses the
LLM's own token probability scores for reference answers as the reward signal
and maximizes the expected reward during training. We find that addressing the
high variance of this noisy probability reward is crucial to make it work, and
propose prob-to-reward and stabilizing methods to ensure a precise and stable
reward from LLM intrinsic probabilities. Comprehensive experiments in four
general-domain benchmarks and three mathematical benchmarks show that RLPR
consistently improves reasoning capabilities in both areas for Gemma, Llama,
and Qwen based models. Notably, RLPR outperforms concurrent VeriFree by 7.6
points on TheoremQA and 7.5 points on Minerva, and even surpasses strong
verifier-model-dependent approaches General-Reasoner by 1.6 average points
across seven benchmarks.