RLPR:將RLVR推廣至無驗證器的通用領域
RLPR: Extrapolating RLVR to General Domains without Verifiers
June 23, 2025
作者: Tianyu Yu, Bo Ji, Shouli Wang, Shu Yao, Zefan Wang, Ganqu Cui, Lifan Yuan, Ning Ding, Yuan Yao, Zhiyuan Liu, Maosong Sun, Tat-Seng Chua
cs.AI
摘要
基於可驗證獎勵的強化學習(RLVR)在提升大型語言模型(LLMs)的推理能力方面展現出顯著潛力。然而,其成功目前主要局限於數學和代碼領域。這一主要限制源於對領域特定驗證器的過度依賴,導致了極高的複雜性和有限的可擴展性。為應對這一挑戰,我們的核心觀察是,LLM生成正確自由形式答案的內在概率直接反映了其對推理獎勵的自我評估(即推理過程在多大程度上導向正確答案)。基於這一洞見,我們提出了RLPR,這是一個簡單的無驗證器框架,將RLVR推廣至更廣泛的通用領域。RLPR利用LLM自身對參考答案的詞元概率分數作為獎勵信號,並在訓練過程中最大化期望獎勵。我們發現,解決這一噪聲概率獎勵的高方差問題對於其有效性至關重要,因此提出了概率轉獎勵和穩定化方法,以確保從LLM內在概率中獲得精確且穩定的獎勵。在四個通用領域基準和三個數學基準上的全面實驗表明,RLPR在Gemma、Llama和Qwen系列模型的推理能力上均實現了持續提升。值得注意的是,RLPR在TheoremQA上比並行的VeriFree高出7.6分,在Minerva上高出7.5分,甚至在七個基準上平均超過依賴強驗證器模型的方法General-Reasoner 1.6分。
English
Reinforcement Learning with Verifiable Rewards (RLVR) demonstrates promising
potential in advancing the reasoning capabilities of LLMs. However, its success
remains largely confined to mathematical and code domains. This primary
limitation stems from the heavy reliance on domain-specific verifiers, which
results in prohibitive complexity and limited scalability. To address the
challenge, our key observation is that LLM's intrinsic probability of
generating a correct free-form answer directly indicates its own evaluation of
the reasoning reward (i.e., how well the reasoning process leads to the correct
answer). Building on this insight, we propose RLPR, a simple verifier-free
framework that extrapolates RLVR to broader general domains. RLPR uses the
LLM's own token probability scores for reference answers as the reward signal
and maximizes the expected reward during training. We find that addressing the
high variance of this noisy probability reward is crucial to make it work, and
propose prob-to-reward and stabilizing methods to ensure a precise and stable
reward from LLM intrinsic probabilities. Comprehensive experiments in four
general-domain benchmarks and three mathematical benchmarks show that RLPR
consistently improves reasoning capabilities in both areas for Gemma, Llama,
and Qwen based models. Notably, RLPR outperforms concurrent VeriFree by 7.6
points on TheoremQA and 7.5 points on Minerva, and even surpasses strong
verifier-model-dependent approaches General-Reasoner by 1.6 average points
across seven benchmarks.