学習された信頼性を伴うプロセス報酬

要旨

プロセス報酬モデル（PRM）は推論におけるステップ単位のフィードバックを提供するが、現在のPRMは通常、各ステップに対して単一の報酬スコアのみを出力する。そのため、下流の手法は不完全なステップ単位の報酬予測を、いつこれらの予測を信頼すべきかという指標なしに、信頼できる意思決定信号として扱わざるを得ない。本稿では、ステップ単位の成功確率とその予測の信頼性の両方を予測する分布型PRMであるBetaPRMを提案する。モンテカルロ続行によるステップ成功の教師信号を用いて、BetaPRMは有限サンプルの成功比率を点目標として回帰するのではなく、ベータ二項尤度を介して観測された成功続行数を説明するベータ信念を学習する。この学習された信頼性信号は、ステップ報酬をいつ信頼すべきかを示し、下流アプリケーションが信頼できる報酬と不確かな報酬を区別することを可能にする。一つの応用として、PRM誘導型Best-of-N推論のための適応的計算割り当て（ACA）を導入する。ACAは学習された信頼性信号を活用し、高い報酬を持つ解が信頼できる場合に計算を停止し、不確かな候補プレフィックスに対して追加の計算を投入する。4つのバックボーンと4つの推論ベンチマークにわたる実験により、BetaPRMはPRM誘導型Best-of-N選択を改善しつつ、標準的なステップ単位の誤り検出を維持することを示す。この信号に基づいて構築されたACAは、固定予算Best-of-16と比較して精度とトークン使用量のトレードオフを改善し、最終回答の精度を向上させながらトークン使用量を最大33.57%削減する。

English

Process Reward Models (PRMs) provide step-level feedback for reasoning, but current PRMs usually output only a single reward score for each step. Downstream methods must therefore treat imperfect step-level reward predictions as reliable decision signals, with no indication of when these predictions should be trusted. We propose BetaPRM, a distributional PRM that predicts both a step-level success probability and the reliability of that prediction. Given step-success supervision from Monte Carlo continuations, BetaPRM learns a Beta belief that explains the observed number of successful continuations through a Beta-Binomial likelihood, rather than regressing to the finite-sample success ratio as a point target. This learned reliability signal indicates when a step reward should be trusted, enabling downstream applications to distinguish reliable rewards from uncertain ones. As one application, we introduce Adaptive Computation Allocation (ACA) for PRM-guided Best-of-N reasoning. ACA uses the learned reliability signal to stop when a high-reward solution is reliable and to spend additional computation on uncertain candidate prefixes. Experiments across four backbones and four reasoning benchmarks show that BetaPRM improves PRM-guided Best-of-N selection while preserving standard step-level error detection. Built on this signal, ACA improves the accuracy--token tradeoff over fixed-budget Best-of-16, reducing token usage by up to 33.57% while improving final-answer accuracy.