基于可验证奖励的强化学习隐式激励基础大语言模型进行正确推理

摘要

可验证奖励的强化学习（RLVR）已成为提升大语言模型（LLMs）推理能力的一种有前景的范式。然而，一个关键悖论笼罩着其有效性：经过RLVR调优的模型在寻找解决方案的Pass@K指标上往往表现不如基础模型，这引发了RLVR仅是在牺牲推理多样性的前提下重新加权现有推理路径的假设。在本研究中，我们通过识别问题的根源解决了这一矛盾：Pass@K指标本身作为推理衡量标准存在缺陷，因为它将可能源自不准确或不完整思维链（CoTs）的正确最终答案也计入了成绩。为此，我们引入了一个更为精确的评估指标——CoT-Pass@K，该指标要求推理路径和最终答案均需正确。我们提供了一个新的理论基础，形式化地阐述了RLVR与传统强化学习不同，其独特结构旨在激励逻辑完整性。我们的实证结果支持这一观点：使用CoT-Pass@K，我们观察到RLVR能够激励对所有K值下正确推理的泛化。此外，通过分析训练动态，我们发现这种增强的推理能力在训练早期便显现，并平稳地实现了泛化。我们的工作为RLVR的角色提供了清晰的视角，提出了更为可靠的评估方法，并证实了其真正推动机器推理发展的潜力。

English

Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a promising paradigm for advancing the reasoning capabilities of Large Language Models (LLMs). However, a critical paradox clouds its efficacy: RLVR-tuned models often underperform their base models on the Pass@K metric for solution-finding, leading to the hypothesis that RLVR merely re-weights existing reasoning paths at the cost of reasoning diversity. In this work, we resolve this contradiction by identifying the source of the problem: the Pass@K metric itself is a flawed measure of reasoning, as it credits correct final answers that probably arise from inaccurate or incomplete chains of thought (CoTs). To address this, we introduce a more precise evaluation metric, CoT-Pass@K, which mandates that both the reasoning path and the final answer be correct. We provide a new theoretical foundation that formalizes how RLVR, unlike traditional RL, is uniquely structured to incentivize logical integrity. Our empirical results are supportive: using CoT-Pass@K, we observe that RLVR can incentivize the generalization of correct reasoning for all values of K. Furthermore, by analyzing the training dynamics, we find that this enhanced reasoning capability emerges early in the training process and smoothly generalizes. Our work provides a clear perspective on the role of RLVR, offers a more reliable method for its evaluation, and confirms its potential to genuinely advance machine reasoning.

基于可验证奖励的强化学习隐式激励基础大语言模型进行正确推理

Reinforcement Learning with Verifiable Rewards Implicitly Incentivizes Correct Reasoning in Base LLMs

摘要

Support