RLにおける価値の再考：LLM推論器と検証器の統合によるテスト時スケーリングの改善

要旨

大規模言語モデル（LLM）の推論を微調整するための一般的な強化学習（RL）手法、例えばGRPOやLeave-one-out PPOなどは、学習された価値関数を放棄し、経験的に推定されたリターンを採用しています。これにより、検証のために価値関数を使用するテスト時の計算スケーリングが妨げられています。本研究では、任意の「価値関数なし」RL手法を拡張するRL^Vを提案します。RL^Vは、RL生成データを使用してLLMを推論器と生成検証器の両方として共同で訓練し、大きなオーバーヘッドなしに検証機能を追加します。実験的に、RL^Vは並列サンプリングによりMATHの精度を20％以上向上させ、ベースのRL手法と比較して8～32倍の効率的なテスト時の計算スケーリングを実現します。また、RL^Vは、容易なタスクから困難なタスク、さらにはドメイン外のタスクに対して強い汎化能力を示します。さらに、RL^Vは、長い推論を行うR1モデルにおいて、並列および逐次のテスト時の計算を共同でスケーリングする場合に1.2～1.6倍の高い性能を達成します。

English

Prevalent reinforcement learning~(RL) methods for fine-tuning LLM reasoners, such as GRPO or Leave-one-out PPO, abandon the learned value function in favor of empirically estimated returns. This hinders test-time compute scaling that relies on using the value-function for verification. In this work, we propose RL^V that augments any ``value-free'' RL method by jointly training the LLM as both a reasoner and a generative verifier using RL-generated data, adding verification capabilities without significant overhead. Empirically, RL^V boosts MATH accuracy by over 20\% with parallel sampling and enables 8-32times efficient test-time compute scaling compared to the base RL method. RL^V also exhibits strong generalization capabilities for both easy-to-hard and out-of-domain tasks. Furthermore, RL^V achieves 1.2-1.6times higher performance when jointly scaling parallel and sequential test-time compute with a long reasoning R1 model.

RLにおける価値の再考：LLM推論器と検証器の統合によるテスト時スケーリングの改善

Putting the Value Back in RL: Better Test-Time Scaling by Unifying LLM Reasoners With Verifiers

要旨

Support