无监督过程奖励模型

摘要

过程奖励模型（PRMs）通过提供细粒度、逐步骤的监督，成为引导大语言模型推理的强大机制。然而，这种有效性伴随着高昂代价：PRMs需要每个推理步骤的专家标注，使得其成本高昂且难以扩展。本文提出一种无需人工监督的无监督PRM（uPRM）训练方法，既不需要逐步骤标注，也不需要最终答案的真实性验证。该方法的核心思想是定义一个基于大模型逐词概率的评分函数，该函数可联合评估一批推理轨迹中首个错误步骤的候选位置。我们在多种场景下验证了uPRM的有效性：（i）在ProcessBench数据集上识别首个错误步骤时，uPRM相较于"大模型作为裁判"方法实现了高达15%的绝对准确率提升；（ii）作为测试时扩展的验证器，uPRM性能与有监督PRM相当，且相比多数投票基线方法提升了6.9%；（iii）作为强化学习中的奖励信号时，uPRM在整个训练过程中比使用真实标签训练的有监督PRM实现了更稳健的策略优化。总体而言，我们的研究结果为实现复杂推理任务的可扩展奖励建模开辟了一条新路径。

English

Process Reward Models (PRMs) are a powerful mechanism for steering large language model reasoning by providing fine-grained, step-level supervision. However, this effectiveness comes at a significant cost: PRMs require expert annotations for every reasoning step, making them costly and difficult to scale. Here, we propose a method for training unsupervised PRMs (uPRM) that requires no human supervision, neither at the level of step-by-step annotations nor through ground-truth verification of final answers. The key idea behind our approach is to define a scoring function, derived from LLM next-token probabilities, that jointly assesses candidate positions of first erroneous steps across a batch of reasoning trajectories. We demonstrate the effectiveness of uPRM across diverse scenarios: (i) uPRM achieves up to 15% absolute accuracy improvements over the LLM-as-a-Judge in identifying first erroneous steps on the ProcessBench dataset; (ii) as a verifier for test-time scaling, uPRM performs comparably to supervised PRMs and outperforms the majority voting baseline by up to 6.9%, and (iii) when used as a reward signal in reinforcement learning, uPRM enables more robust policy optimization throughout training compared to a supervised PRM trained using ground-truth labels. Overall, our results open a path toward scalable reward modeling for complex reasoning tasks.