RLPR：検証器なしで一般領域へのRLVRの拡張

要旨

検証可能な報酬を用いた強化学習（RLVR）は、LLMの推論能力を向上させる上で有望な可能性を示しています。しかし、その成功は主に数学やコードの領域に限定されています。この主な制約は、ドメイン固有の検証器への過度な依存に起因しており、複雑さが過大でスケーラビリティが限定的という結果を招いています。この課題に対処するため、我々はLLMが正しい自由形式の回答を生成する内在的な確率が、そのまま推論報酬（すなわち、推論プロセスが正しい回答に導く度合い）の自己評価を示すという重要な観察を行いました。この洞察に基づき、我々はRLVRをより広範な一般領域に拡張するシンプルな検証器不要のフレームワークであるRLPRを提案します。RLPRは、参照回答に対するLLM自身のトークン確率スコアを報酬信号として使用し、トレーニング中に期待報酬を最大化します。このノイズの多い確率報酬の高い分散に対処することが重要であることを見出し、LLMの内在的な確率から正確で安定した報酬を確保するために、prob-to-rewardと安定化手法を提案します。4つの一般領域ベンチマークと3つの数学ベンチマークにおける包括的な実験により、RLPRがGemma、Llama、Qwenベースのモデルにおいて、両領域で推論能力を一貫して向上させることが示されました。特に、RLPRはTheoremQAでVeriFreeを7.6ポイント、Minervaで7.5ポイント上回り、7つのベンチマーク全体で強力な検証器モデル依存アプローチであるGeneral-Reasonerを平均1.6ポイント上回る結果を示しました。

English

Reinforcement Learning with Verifiable Rewards (RLVR) demonstrates promising potential in advancing the reasoning capabilities of LLMs. However, its success remains largely confined to mathematical and code domains. This primary limitation stems from the heavy reliance on domain-specific verifiers, which results in prohibitive complexity and limited scalability. To address the challenge, our key observation is that LLM's intrinsic probability of generating a correct free-form answer directly indicates its own evaluation of the reasoning reward (i.e., how well the reasoning process leads to the correct answer). Building on this insight, we propose RLPR, a simple verifier-free framework that extrapolates RLVR to broader general domains. RLPR uses the LLM's own token probability scores for reference answers as the reward signal and maximizes the expected reward during training. We find that addressing the high variance of this noisy probability reward is crucial to make it work, and propose prob-to-reward and stabilizing methods to ensure a precise and stable reward from LLM intrinsic probabilities. Comprehensive experiments in four general-domain benchmarks and three mathematical benchmarks show that RLPR consistently improves reasoning capabilities in both areas for Gemma, Llama, and Qwen based models. Notably, RLPR outperforms concurrent VeriFree by 7.6 points on TheoremQA and 7.5 points on Minerva, and even surpasses strong verifier-model-dependent approaches General-Reasoner by 1.6 average points across seven benchmarks.

RLPR：検証器なしで一般領域へのRLVRの拡張

RLPR: Extrapolating RLVR to General Domains without Verifiers

要旨

Support