RL-PLUS: ハイブリッド方策最適化による強化学習における大規模言語モデルの能力境界崩壊への対処

要旨

検証可能な報酬を用いた強化学習（RLVR）は、大規模言語モデル（LLM）の複雑な推論能力を大幅に向上させてきた。しかし、RLVRは本質的にオンポリシー戦略を採用しており、LLMの膨大な行動空間と希薄な報酬のため、基盤となるLLMの能力限界を突破することが困難である。特に、RLVRは能力限界の崩壊を引き起こし、LLMの問題解決範囲を狭める可能性がある。この問題に対処するため、我々はRL-PLUSを提案する。これは、内部の探索と外部データを統合し、より強力な推論能力を実現し、基盤モデルの限界を超えるための新しいハイブリッドポリシー最適化手法である。RL-PLUSは、外部データからの分布のミスマッチを解決するための多重重要度サンプリングと、高価値かつ未探索の推論パスへモデルを導くための探索ベースのアドバンテージ関数という2つのコアコンポーネントを統合している。我々は、理論分析と広範な実験を通じて、本手法の優位性と汎用性を実証する。既存のRLVR手法と比較して、RL-PLUSは1）6つの数学推論ベンチマークで最先端の性能を達成し、2）6つの分布外推論タスクで優れた性能を示し、3）多様なモデルファミリーにわたって一貫した大幅な改善をもたらし、平均相対改善率は最大69.2％に達する。さらに、Pass@k曲線の分析により、RL-PLUSが能力限界の崩壊問題を効果的に解決することが示された。

English

Reinforcement Learning with Verifiable Reward (RLVR) has significantly advanced the complex reasoning abilities of Large Language Models (LLMs). However, it struggles to break through the inherent capability boundaries of the base LLM, due to its essentially on-policy strategy coupled with LLM's immense action space and sparse reward. Critically, RLVR can lead to the capability boundary collapse, narrowing the LLM's problem-solving scope. To address this problem, we propose RL-PLUS, a novel hybrid-policy optimization approach for LLMs that synergizes internal exploitation with external data to achieve stronger reasoning capabilities and surpass the boundaries of base models. RL-PLUS integrates two core components, i.e., Multiple Importance Sampling to address distributional mismatch from external data, and Exploration-Based Advantage Function to guide the model towards high-value, unexplored reasoning paths. We provide both theoretical analysis and extensive experiments to demonstrate the superiority and generalizability of our approach. Compared with existing RLVR methods, RL-PLUS achieves 1) state-of-the-art performance on six math reasoning benchmarks; 2) superior performance on six out-of-distribution reasoning tasks; 3) consistent and significant gains across diverse model families, with average relative improvements up to 69.2\%. Moreover, the analysis of Pass@k curves indicates that RL-PLUS effectively resolves the capability boundary collapse problem.

RL-PLUS: ハイブリッド方策最適化による強化学習における大規模言語モデルの能力境界崩壊への対処

RL-PLUS: Countering Capability Boundary Collapse of LLMs in Reinforcement Learning with Hybrid-policy Optimization

要旨

Support