以學習可靠性處理獎勵

摘要

過程獎勵模型（PRM）為推理提供了步驟層級的獎勵回饋，然而現有的PRM通常僅對每個步驟輸出單一的獎勵分數。這使得下游方法必須將不完美的步驟層級獎勵預測視為可靠的決策訊號，且無法得知何時該信任這些預測。我們提出BetaPRM，一種分配式PRM，能同時預測步驟層級的成功機率與該預測的可靠性。在蒙地卡羅延續過程所提供步驟成功監督訊號的基礎上，BetaPRM學習一個貝塔信念（Beta belief），該信念透過貝塔-二項似然（Beta-Binomial likelihood）來解釋觀測到的成功延續次數，而非如同傳統做法般將有限樣本的成功比率作為點目標進行迴歸。此學習到的可靠性訊號能指示何時應信任步驟獎勵，使下游應用能夠區分可靠獎勵與不確定獎勵。作為其中一項應用，我們針對PRM引導的N選1（Best-of-N）推理提出適應性計算配置（Adaptive Computation Allocation, ACA）。ACA利用學習到的可靠性訊號，在取得可靠的高獎勵解答時停止計算，並對不確定的候選前綴投入額外計算資源。在四個骨幹模型與四個推理基準測試上的實驗結果顯示，BetaPRM不僅提升了PRM引導之N選1的選擇效能，同時保留了標準的步驟層級錯誤檢測能力。基於此訊號建構的ACA，在準確率與詞元（token）使用量的取捨上優於固定預算的N選1（Best-of-16），能在最高減少33.57%詞元使用量的同時提升最終答案的準確率。

English

Process Reward Models (PRMs) provide step-level feedback for reasoning, but current PRMs usually output only a single reward score for each step. Downstream methods must therefore treat imperfect step-level reward predictions as reliable decision signals, with no indication of when these predictions should be trusted. We propose BetaPRM, a distributional PRM that predicts both a step-level success probability and the reliability of that prediction. Given step-success supervision from Monte Carlo continuations, BetaPRM learns a Beta belief that explains the observed number of successful continuations through a Beta-Binomial likelihood, rather than regressing to the finite-sample success ratio as a point target. This learned reliability signal indicates when a step reward should be trusted, enabling downstream applications to distinguish reliable rewards from uncertain ones. As one application, we introduce Adaptive Computation Allocation (ACA) for PRM-guided Best-of-N reasoning. ACA uses the learned reliability signal to stop when a high-reward solution is reliable and to spend additional computation on uncertain candidate prefixes. Experiments across four backbones and four reasoning benchmarks show that BetaPRM improves PRM-guided Best-of-N selection while preserving standard step-level error detection. Built on this signal, ACA improves the accuracy--token tradeoff over fixed-budget Best-of-16, reducing token usage by up to 33.57% while improving final-answer accuracy.