基於可驗證獎勵的強化學習隱含激勵基礎大語言模型的正確推理

摘要

基於可驗證獎勵的強化學習（RLVR）已成為提升大型語言模型（LLMs）推理能力的一種前景廣闊的範式。然而，其有效性被一個關鍵悖論所籠罩：經RLVR調優的模型在解決方案發現的Pass@K指標上往往表現不如其基礎模型，這引發了一種假設，即RLVR僅僅是重新加權現有的推理路徑，而犧牲了推理的多樣性。在本研究中，我們通過識別問題的根源來解決這一矛盾：Pass@K指標本身作為推理的衡量標準存在缺陷，因為它將正確的最終答案歸功於可能源自不準確或不完整的思維鏈（CoTs）。為此，我們引入了一種更精確的評估指標——CoT-Pass@K，該指標要求推理路徑和最終答案都必須正確。我們提供了一個新的理論基礎，形式化地闡述了RLVR與傳統強化學習不同，其獨特結構旨在激勵邏輯完整性。我們的實證結果支持這一觀點：使用CoT-Pass@K，我們觀察到RLVR能夠激勵正確推理的泛化，適用於所有K值。此外，通過分析訓練動態，我們發現這種增強推理能力在訓練過程早期便已顯現，並能平穩地泛化。我們的工作為RLVR的角色提供了清晰的視角，提供了一種更可靠的評估方法，並證實了其真正推進機器推理的潛力。

English

Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a promising paradigm for advancing the reasoning capabilities of Large Language Models (LLMs). However, a critical paradox clouds its efficacy: RLVR-tuned models often underperform their base models on the Pass@K metric for solution-finding, leading to the hypothesis that RLVR merely re-weights existing reasoning paths at the cost of reasoning diversity. In this work, we resolve this contradiction by identifying the source of the problem: the Pass@K metric itself is a flawed measure of reasoning, as it credits correct final answers that probably arise from inaccurate or incomplete chains of thought (CoTs). To address this, we introduce a more precise evaluation metric, CoT-Pass@K, which mandates that both the reasoning path and the final answer be correct. We provide a new theoretical foundation that formalizes how RLVR, unlike traditional RL, is uniquely structured to incentivize logical integrity. Our empirical results are supportive: using CoT-Pass@K, we observe that RLVR can incentivize the generalization of correct reasoning for all values of K. Furthermore, by analyzing the training dynamics, we find that this enhanced reasoning capability emerges early in the training process and smoothly generalizes. Our work provides a clear perspective on the role of RLVR, offers a more reliable method for its evaluation, and confirms its potential to genuinely advance machine reasoning.

基於可驗證獎勵的強化學習隱含激勵基礎大語言模型的正確推理

Reinforcement Learning with Verifiable Rewards Implicitly Incentivizes Correct Reasoning in Base LLMs

摘要

Support