GenPRM: 生成推論によるプロセス報酬モデルのテスト時計算のスケーリング

要旨

大規模言語モデル（LLMs）の最近の進展により、プロセス報酬モデル（PRMs）を検証器として活用することがLLMsの性能向上に有望であることが示されています。しかし、現在のPRMsは3つの主要な課題に直面しています：（1）プロセス監視と汎化能力の限界、（2）スカラー値予測への依存とLLMsの生成能力の活用不足、（3）PRMsのテスト時計算リソースのスケーリングの不可能性。本研究では、GenPRMを紹介します。これは、各推論ステップの判断を提供する前に、コード検証を伴う明示的なChain-of-Thought（CoT）推論を行う生成型プロセス報酬モデルです。高品質なプロセス監視ラベルと理論的根拠データを取得するために、相対的進捗推定（RPE）とコード検証を組み込んだ理論的根拠合成フレームワークを提案します。ProcessBenchおよびいくつかの数学的推論タスクでの実験結果は、GenPRMがMATHデータセットからのわずか23Kのトレーニングデータで、従来のPRMsを大幅に上回ることを示しています。テスト時スケーリングを通じて、1.5BのGenPRMはGPT-4oを上回り、7BのGenPRMはProcessBenchでQwen2.5-Math-PRM-72Bを凌駕します。さらに、GenPRMは、ポリシーモデルの改良のための批評モデルとしての強力な能力を示します。本研究は、PRMsとLLMsの批評モデルの間のギャップを埋める新しいプロセス監視のパラダイムを確立します。私たちのコード、モデル、データはhttps://ryanliu112.github.io/GenPRMで公開されます。

English

Recent advancements in Large Language Models (LLMs) have shown that it is promising to utilize Process Reward Models (PRMs) as verifiers to enhance the performance of LLMs. However, current PRMs face three key challenges: (1) limited process supervision and generalization capabilities, (2) dependence on scalar value prediction without leveraging the generative abilities of LLMs, and (3) inability to scale the test-time compute of PRMs. In this work, we introduce GenPRM, a generative process reward model that performs explicit Chain-of-Thought (CoT) reasoning with code verification before providing judgment for each reasoning step. To obtain high-quality process supervision labels and rationale data, we propose Relative Progress Estimation (RPE) and a rationale synthesis framework that incorporates code verification. Experimental results on ProcessBench and several mathematical reasoning tasks show that GenPRM significantly outperforms prior PRMs with only 23K training data from MATH dataset. Through test-time scaling, a 1.5B GenPRM outperforms GPT-4o, and a 7B GenPRM surpasses Qwen2.5-Math-PRM-72B on ProcessBench. Additionally, GenPRM demonstrates strong abilities to serve as a critic model for policy model refinement. This work establishes a new paradigm for process supervision that bridges the gap between PRMs and critic models in LLMs. Our code, model, and data will be available in https://ryanliu112.github.io/GenPRM.

GenPRM: 生成推論によるプロセス報酬モデルのテスト時計算のスケーリング

GenPRM: Scaling Test-Time Compute of Process Reward Models via Generative Reasoning

要旨

Support