학습된 신뢰성을 통한 과정 보상

초록

프로세스 보상 모델(PRM)은 추론에 대해 단계 수준의 피드백을 제공하지만, 현재의 PRM은 일반적으로 각 단계에 대해 단일 보상 점수만 출력한다. 따라서 후속 방법들은 불완전한 단계 수준 보상 예측을 신뢰할 수 있는 결정 신호로 취급해야 하며, 이러한 예측을 언제 신뢰해야 하는지에 대한 지표가 없다. 우리는 단계 수준 성공 확률과 해당 예측의 신뢰성을 모두 예측하는 분포적 PRM인 BetaPRM을 제안한다. 몬테카를로 연속 과정으로부터 단계 성공 감독이 주어졌을 때, BetaPRM은 유한 표본 성공 비율을 점 목표로 회귀하는 대신 베타-이항 가능도를 통해 관찰된 성공 연속 횟수를 설명하는 베타 신념을 학습한다. 이렇게 학습된 신뢰성 신호는 단계 보상을 언제 신뢰해야 하는지를 나타내며, 후속 응용 프로그램이 신뢰할 수 있는 보상과 불확실한 보상을 구분할 수 있게 한다. 하나의 응용으로, 우리는 PRM 기반 Best-of-N 추론을 위한 적응형 계산 할당(ACA)을 소개한다. ACA는 학습된 신뢰성 신호를 사용하여 높은 보상을 가진 해결책이 신뢰할 수 있을 때 중단하고, 불확실한 후보 접두사에 추가 계산을 투자한다. 네 가지 백본과 네 가지 추론 벤치마크에 걸친 실험은 BetaPRM이 표준 단계 수준 오류 탐지를 유지하면서 PRM 기반 Best-of-N 선택을 개선함을 보여준다. 이 신호를 기반으로 ACA는 고정 예산 Best-of-16 대비 정확도-토큰 균형을 개선하여 최종 답변 정확도를 향상시키면서 토큰 사용량을 최대 33.57%까지 줄인다.

English

Process Reward Models (PRMs) provide step-level feedback for reasoning, but current PRMs usually output only a single reward score for each step. Downstream methods must therefore treat imperfect step-level reward predictions as reliable decision signals, with no indication of when these predictions should be trusted. We propose BetaPRM, a distributional PRM that predicts both a step-level success probability and the reliability of that prediction. Given step-success supervision from Monte Carlo continuations, BetaPRM learns a Beta belief that explains the observed number of successful continuations through a Beta-Binomial likelihood, rather than regressing to the finite-sample success ratio as a point target. This learned reliability signal indicates when a step reward should be trusted, enabling downstream applications to distinguish reliable rewards from uncertain ones. As one application, we introduce Adaptive Computation Allocation (ACA) for PRM-guided Best-of-N reasoning. ACA uses the learned reliability signal to stop when a high-reward solution is reliable and to spend additional computation on uncertain candidate prefixes. Experiments across four backbones and four reasoning benchmarks show that BetaPRM improves PRM-guided Best-of-N selection while preserving standard step-level error detection. Built on this signal, ACA improves the accuracy--token tradeoff over fixed-budget Best-of-16, reducing token usage by up to 33.57% while improving final-answer accuracy.