RLPR: 검증자 없이 일반 도메인에 RLVR을 확장하기

초록

검증 가능한 보상을 활용한 강화 학습(RLVR)은 대형 언어 모델(LLM)의 추론 능력을 발전시킬 수 있는 유망한 잠재력을 보여줍니다. 그러나 그 성공은 주로 수학 및 코드 도메인에 국한되어 있습니다. 이러한 주요 한계는 도메인 특화 검증기에 대한 과도한 의존에서 비롯되며, 이는 복잡성을 과도하게 증가시키고 확장성을 제한하는 결과를 가져옵니다. 이 문제를 해결하기 위해, 우리는 LLM이 자유 형식의 정답을 생성할 때의 내재적 확률이 그 자체의 추론 보상(즉, 추론 과정이 정답으로 이어지는 정도)을 직접적으로 나타낸다는 핵심 관찰을 바탕으로 합니다. 이러한 통찰을 기반으로, 우리는 RLVR을 더 넓은 일반 도메인으로 확장하는 간단한 검증기 없는 프레임워크인 RLPR을 제안합니다. RLPR은 LLM의 토큰 확률 점수를 참조 답변에 대한 보상 신호로 사용하며, 훈련 중에 기대 보상을 최대화합니다. 우리는 이 잡음이 섞인 확률 보상의 높은 분산을 해결하는 것이 중요하다는 것을 발견했으며, 이를 위해 확률-보상 변환 및 안정화 기법을 제안하여 LLM의 내재적 확률로부터 정확하고 안정적인 보상을 보장합니다. 네 가지 일반 도메인 벤치마크와 세 가지 수학 벤치마크에서의 포괄적인 실험을 통해 RLPR이 Gemma, Llama, Qwen 기반 모델에서 두 영역 모두에서 추론 능력을 꾸준히 향상시킨다는 것을 확인했습니다. 특히, RLPR은 TheoremQA에서 VeriFree를 7.6점, Minerva에서 7.5점 앞섰으며, 강력한 검증기 모델에 의존하는 General-Reasoner 접근법보다도 7개 벤치마크에서 평균 1.6점 더 높은 성능을 보였습니다.

English

Reinforcement Learning with Verifiable Rewards (RLVR) demonstrates promising potential in advancing the reasoning capabilities of LLMs. However, its success remains largely confined to mathematical and code domains. This primary limitation stems from the heavy reliance on domain-specific verifiers, which results in prohibitive complexity and limited scalability. To address the challenge, our key observation is that LLM's intrinsic probability of generating a correct free-form answer directly indicates its own evaluation of the reasoning reward (i.e., how well the reasoning process leads to the correct answer). Building on this insight, we propose RLPR, a simple verifier-free framework that extrapolates RLVR to broader general domains. RLPR uses the LLM's own token probability scores for reference answers as the reward signal and maximizes the expected reward during training. We find that addressing the high variance of this noisy probability reward is crucial to make it work, and propose prob-to-reward and stabilizing methods to ensure a precise and stable reward from LLM intrinsic probabilities. Comprehensive experiments in four general-domain benchmarks and three mathematical benchmarks show that RLPR consistently improves reasoning capabilities in both areas for Gemma, Llama, and Qwen based models. Notably, RLPR outperforms concurrent VeriFree by 7.6 points on TheoremQA and 7.5 points on Minerva, and even surpasses strong verifier-model-dependent approaches General-Reasoner by 1.6 average points across seven benchmarks.

RLPR: 검증자 없이 일반 도메인에 RLVR을 확장하기

RLPR: Extrapolating RLVR to General Domains without Verifiers

초록

Support