具有学习可靠性的过程奖励

摘要

过程奖励模型（PRMs）为推理过程提供步骤级别的反馈，但当前的PRMs通常仅为每个步骤输出单一奖励分数。因此，下游方法必须将不完美的步骤级奖励预测视为可靠的决策信号，而无法获知这些预测在何时应被信任。我们提出BetaPRM，这是一种分布式的PRM，可同时预测步骤级别的成功概率及其预测的可靠性。基于蒙特卡洛延续路径的步骤成功监督信号，BetaPRM通过Beta-Binomial似然函数学习一个Beta信念分布，以解释观察到的成功延续路径数量，而非将有限样本的成功率作为点目标进行回归拟合。这种学习到的可靠性信号能够指示何时应信任步骤奖励，使下游应用能够区分可靠奖励与不确定奖励。作为一项应用，我们针对PRM引导的Best-of-N推理提出自适应计算分配（ACA）方法。ACA利用学习到的可靠性信号，在高奖励解可靠时停止推理，并在不确定的候选前缀上投入额外计算资源。在四种不同基础模型和四个推理基准上的实验表明，BetaPRM在提升PRM引导的Best-of-N选择性能的同时，保持了标准的步骤级错误检测能力。基于此信号，ACA在准确性-计算量权衡上优于固定预算的Best-of-16方法，在最终答案准确率提升的同时，最多可减少33.57%的计算量。

English

Process Reward Models (PRMs) provide step-level feedback for reasoning, but current PRMs usually output only a single reward score for each step. Downstream methods must therefore treat imperfect step-level reward predictions as reliable decision signals, with no indication of when these predictions should be trusted. We propose BetaPRM, a distributional PRM that predicts both a step-level success probability and the reliability of that prediction. Given step-success supervision from Monte Carlo continuations, BetaPRM learns a Beta belief that explains the observed number of successful continuations through a Beta-Binomial likelihood, rather than regressing to the finite-sample success ratio as a point target. This learned reliability signal indicates when a step reward should be trusted, enabling downstream applications to distinguish reliable rewards from uncertain ones. As one application, we introduce Adaptive Computation Allocation (ACA) for PRM-guided Best-of-N reasoning. ACA uses the learned reliability signal to stop when a high-reward solution is reliable and to spend additional computation on uncertain candidate prefixes. Experiments across four backbones and four reasoning benchmarks show that BetaPRM improves PRM-guided Best-of-N selection while preserving standard step-level error detection. Built on this signal, ACA improves the accuracy--token tradeoff over fixed-budget Best-of-16, reducing token usage by up to 33.57% while improving final-answer accuracy.