GroundedPRM：ステップレベルの推論のためのツリー誘導型かつ忠実性を考慮したプロセス報酬モデリング

要旨

プロセス報酬モデル（PRM）は、中間ステップを監督しエラーを特定することで、大規模言語モデル（LLM）における多段階推論を改善することを目指している。しかし、スケーラブルで高品質なアノテーションの不足により、効果的なPRMの構築は依然として課題となっている。既存のアプローチは、コストのかかる人間によるラベリング、幻覚を起こしやすいLLMベースの自己評価、またはロールアウト結果のみからステップの品質を推測するモンテカルロ（MC）推定に依存しており、クレジットの誤帰属によるノイズの多い、整合性のない監督をしばしば導入している。これらの問題は、ノイズの多い報酬、低い事実的忠実度、およびステップレベルの推論目標とのミスアラインメントという3つの核心的な制限をもたらす。これらの課題に対処するため、我々はGroundedPRMを導入する。これは、ツリーガイド型で忠実度を意識した自動プロセス監督のフレームワークである。報酬ノイズを低減し、細粒度のクレジット割り当てを可能にするために、モンテカルロ木探索（MCTS）を介して構造化された推論パスを構築する。幻覚を起こした監督を排除するために、外部ツールを使用して各中間ステップを検証し、実行に基づいた正しさの信号を提供する。ステップレベルの検証とグローバルな結果評価の両方を組み合わせるために、ツールベースの検証とMCTSから得られたフィードバックを融合するハイブリッド報酬集約メカニズムを設計する。最後に、報酬信号を解釈可能性と命令調整されたLLMとの互換性を促進するための根拠を強化した生成構造にフォーマットする。GroundedPRMは、自動ラベル付けされた40Kサンプルのみでトレーニングされ、これは自動ラベル付けされた監督でトレーニングされた最高性能のPRMが使用したデータのわずか10％に相当する。それにもかかわらず、ProcessBenchにおける平均性能で最大26％の相対的改善を達成する。報酬誘導型の貪欲探索に使用された場合、GroundedPRMは人間によるラベル付けされた監督でトレーニングされたPRMをも上回り、高品質なプロセスレベル推論に向けたスケーラブルで検証可能な道を提供する。

English

Process Reward Models (PRMs) aim to improve multi-step reasoning in Large Language Models (LLMs) by supervising intermediate steps and identifying errors. However, building effective PRMs remains challenging due to the lack of scalable, high-quality annotations. Existing approaches rely on costly human labeling, LLM-based self-evaluation that is prone to hallucination, or Monte Carlo (MC) estimation, which infers step quality solely from rollout outcomes and often introduces noisy, misaligned supervision due to credit misattribution. These issues result in three core limitations: noisy rewards, low factual fidelity, and misalignment with step-level reasoning objectives. To address these challenges, we introduce GroundedPRM, a tree-guided and fidelity-aware framework for automatic process supervision. To reduce reward noise and enable fine-grained credit assignment, we construct structured reasoning paths via Monte Carlo Tree Search (MCTS). To eliminate hallucinated supervision, we validate each intermediate step using an external tool, providing execution-grounded correctness signals. To combine both step-level validation and global outcome assessment, we design a hybrid reward aggregation mechanism that fuses tool-based verification with MCTS-derived feedback. Finally, we format the reward signal into a rationale-enhanced, generative structure to promote interpretability and compatibility with instruction-tuned LLMs. GroundedPRM is trained on only 40K automatically labeled samples, amounting to just 10% of the data used by the best-performing PRM trained with auto-labeled supervision. Nevertheless, it achieves up to a 26% relative improvement in average performance on ProcessBench. When used for reward-guided greedy search, GroundedPRM outperforms even PRMs trained with human-labeled supervision, offering a scalable and verifiable path toward high-quality process-level reasoning.

GroundedPRM：ステップレベルの推論のためのツリー誘導型かつ忠実性を考慮したプロセス報酬モデリング

GroundedPRM: Tree-Guided and Fidelity-Aware Process Reward Modeling for Step-Level Reasoning

要旨

Support