RLPR：將RLVR推廣至無驗證器的通用領域

摘要

基於可驗證獎勵的強化學習（RLVR）在提升大型語言模型（LLMs）的推理能力方面展現出顯著潛力。然而，其成功目前主要局限於數學和代碼領域。這一主要限制源於對領域特定驗證器的過度依賴，導致了極高的複雜性和有限的可擴展性。為應對這一挑戰，我們的核心觀察是，LLM生成正確自由形式答案的內在概率直接反映了其對推理獎勵的自我評估（即推理過程在多大程度上導向正確答案）。基於這一洞見，我們提出了RLPR，這是一個簡單的無驗證器框架，將RLVR推廣至更廣泛的通用領域。RLPR利用LLM自身對參考答案的詞元概率分數作為獎勵信號，並在訓練過程中最大化期望獎勵。我們發現，解決這一噪聲概率獎勵的高方差問題對於其有效性至關重要，因此提出了概率轉獎勵和穩定化方法，以確保從LLM內在概率中獲得精確且穩定的獎勵。在四個通用領域基準和三個數學基準上的全面實驗表明，RLPR在Gemma、Llama和Qwen系列模型的推理能力上均實現了持續提升。值得注意的是，RLPR在TheoremQA上比並行的VeriFree高出7.6分，在Minerva上高出7.5分，甚至在七個基準上平均超過依賴強驗證器模型的方法General-Reasoner 1.6分。

English

Reinforcement Learning with Verifiable Rewards (RLVR) demonstrates promising potential in advancing the reasoning capabilities of LLMs. However, its success remains largely confined to mathematical and code domains. This primary limitation stems from the heavy reliance on domain-specific verifiers, which results in prohibitive complexity and limited scalability. To address the challenge, our key observation is that LLM's intrinsic probability of generating a correct free-form answer directly indicates its own evaluation of the reasoning reward (i.e., how well the reasoning process leads to the correct answer). Building on this insight, we propose RLPR, a simple verifier-free framework that extrapolates RLVR to broader general domains. RLPR uses the LLM's own token probability scores for reference answers as the reward signal and maximizes the expected reward during training. We find that addressing the high variance of this noisy probability reward is crucial to make it work, and propose prob-to-reward and stabilizing methods to ensure a precise and stable reward from LLM intrinsic probabilities. Comprehensive experiments in four general-domain benchmarks and three mathematical benchmarks show that RLPR consistently improves reasoning capabilities in both areas for Gemma, Llama, and Qwen based models. Notably, RLPR outperforms concurrent VeriFree by 7.6 points on TheoremQA and 7.5 points on Minerva, and even surpasses strong verifier-model-dependent approaches General-Reasoner by 1.6 average points across seven benchmarks.

RLPR：將RLVR推廣至無驗證器的通用領域

RLPR: Extrapolating RLVR to General Domains without Verifiers

摘要

Support