検証可能な報酬を用いた強化学習は基盤LLMにおける正しい推論を暗黙的に促進する

要旨

検証可能な報酬を伴う強化学習（Reinforcement Learning with Verifiable Rewards, RLVR）は、大規模言語モデル（Large Language Models, LLMs）の推論能力を向上させるための有望なパラダイムとして登場した。しかし、その有効性には重大なパラドックスが存在する：RLVRで調整されたモデルは、解決策を見つけるためのPass@Kメトリックにおいて、ベースモデルをしばしば下回り、RLVRが推論の多様性を犠牲にして既存の推論経路を再重み付けしているだけではないかという仮説が立てられている。本研究では、この矛盾を解決するために、問題の根源を特定する：Pass@Kメトリック自体が推論の不完全な尺度であり、不正確または不完全な思考連鎖（Chains of Thought, CoTs）から生じた正しい最終回答を評価してしまうためである。これを解決するため、より正確な評価指標であるCoT-Pass@Kを導入し、推論経路と最終回答の両方が正しいことを要求する。さらに、RLVRが従来の強化学習とは異なり、論理的整合性を促進するために独自に構造化されていることを形式化する新しい理論的基盤を提供する。実証結果はこれを支持する：CoT-Pass@Kを使用すると、RLVRがすべてのK値において正しい推論の一般化を促進できることが観察される。さらに、トレーニングダイナミクスを分析することで、この強化された推論能力がトレーニングプロセスの早い段階で現れ、スムーズに一般化することがわかる。本研究は、RLVRの役割について明確な視点を提供し、その評価のためのより信頼性の高い方法を提案し、機械推論を真に進化させる可能性を確認するものである。

English

Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a promising paradigm for advancing the reasoning capabilities of Large Language Models (LLMs). However, a critical paradox clouds its efficacy: RLVR-tuned models often underperform their base models on the Pass@K metric for solution-finding, leading to the hypothesis that RLVR merely re-weights existing reasoning paths at the cost of reasoning diversity. In this work, we resolve this contradiction by identifying the source of the problem: the Pass@K metric itself is a flawed measure of reasoning, as it credits correct final answers that probably arise from inaccurate or incomplete chains of thought (CoTs). To address this, we introduce a more precise evaluation metric, CoT-Pass@K, which mandates that both the reasoning path and the final answer be correct. We provide a new theoretical foundation that formalizes how RLVR, unlike traditional RL, is uniquely structured to incentivize logical integrity. Our empirical results are supportive: using CoT-Pass@K, we observe that RLVR can incentivize the generalization of correct reasoning for all values of K. Furthermore, by analyzing the training dynamics, we find that this enhanced reasoning capability emerges early in the training process and smoothly generalizes. Our work provides a clear perspective on the role of RLVR, offers a more reliable method for its evaluation, and confirms its potential to genuinely advance machine reasoning.

検証可能な報酬を用いた強化学習は基盤LLMにおける正しい推論を暗黙的に促進する

Reinforcement Learning with Verifiable Rewards Implicitly Incentivizes Correct Reasoning in Base LLMs

要旨

Support