具有学习可靠性的过程奖励
Process Rewards with Learned Reliability
May 15, 2026
作者: Jinyuan Li, Langlin Huang, Chengsong Huang, Shaoyang Xu, Donghong Cai, Yuyi Yang, Wenxuan Zhang, Jiaxin Huang
cs.AI
摘要
过程奖励模型(PRMs)为推理过程提供步骤级别的反馈,但当前的PRMs通常仅为每个步骤输出单一奖励分数。因此,下游方法必须将不完美的步骤级奖励预测视为可靠的决策信号,而无法获知这些预测在何时应被信任。我们提出BetaPRM,这是一种分布式的PRM,可同时预测步骤级别的成功概率及其预测的可靠性。基于蒙特卡洛延续路径的步骤成功监督信号,BetaPRM通过Beta-Binomial似然函数学习一个Beta信念分布,以解释观察到的成功延续路径数量,而非将有限样本的成功率作为点目标进行回归拟合。这种学习到的可靠性信号能够指示何时应信任步骤奖励,使下游应用能够区分可靠奖励与不确定奖励。作为一项应用,我们针对PRM引导的Best-of-N推理提出自适应计算分配(ACA)方法。ACA利用学习到的可靠性信号,在高奖励解可靠时停止推理,并在不确定的候选前缀上投入额外计算资源。在四种不同基础模型和四个推理基准上的实验表明,BetaPRM在提升PRM引导的Best-of-N选择性能的同时,保持了标准的步骤级错误检测能力。基于此信号,ACA在准确性-计算量权衡上优于固定预算的Best-of-16方法,在最终答案准确率提升的同时,最多可减少33.57%的计算量。
English
Process Reward Models (PRMs) provide step-level feedback for reasoning, but current PRMs usually output only a single reward score for each step. Downstream methods must therefore treat imperfect step-level reward predictions as reliable decision signals, with no indication of when these predictions should be trusted. We propose BetaPRM, a distributional PRM that predicts both a step-level success probability and the reliability of that prediction. Given step-success supervision from Monte Carlo continuations, BetaPRM learns a Beta belief that explains the observed number of successful continuations through a Beta-Binomial likelihood, rather than regressing to the finite-sample success ratio as a point target. This learned reliability signal indicates when a step reward should be trusted, enabling downstream applications to distinguish reliable rewards from uncertain ones. As one application, we introduce Adaptive Computation Allocation (ACA) for PRM-guided Best-of-N reasoning. ACA uses the learned reliability signal to stop when a high-reward solution is reliable and to spend additional computation on uncertain candidate prefixes. Experiments across four backbones and four reasoning benchmarks show that BetaPRM improves PRM-guided Best-of-N selection while preserving standard step-level error detection. Built on this signal, ACA improves the accuracy--token tradeoff over fixed-budget Best-of-16, reducing token usage by up to 33.57% while improving final-answer accuracy.