基于不确定性的自动化过程奖励数据构建与数学推理输出聚合方法
Uncertainty-Based Methods for Automated Process Reward Data Construction and Output Aggregation in Mathematical Reasoning
August 3, 2025
作者: Jiuzhou Han, Wray Buntine, Ehsan Shareghi
cs.AI
摘要
大型语言模型在复杂数学推理任务中展现了卓越的能力,但在多步解答过程中不可避免地会产生错误。过程级奖励模型(PRMs)通过在每一步中间步骤提供监督和评估,显示出极大的潜力,从而有效提升了模型的推理能力。然而,训练高效的PRMs需要高质量的过程奖励数据,而现有构建此类数据的方法往往劳动密集或效率低下。本文提出了一种基于不确定性的自动化过程奖励数据构建框架,涵盖了PRMs的数据生成和标注过程。此外,我们指出了多数投票和PRMs的局限性,并引入了两种通用的不确定性感知输出聚合方法:混合多数奖励投票和加权奖励频率投票,这两种方法结合了多数投票与PRMs的优势。在ProcessBench、MATH和GSMPlus上的大量实验验证了所提出的PRM数据构建框架的有效性和效率,并证明这两种输出聚合方法进一步提升了多种PRMs的数学推理能力。代码和数据将公开于https://github.com/Jiuzhouh/UnPRM。
English
Large language models have demonstrated remarkable capabilities in complex
mathematical reasoning tasks, but they inevitably generate errors throughout
multi-step solutions. Process-level Reward Models (PRMs) have shown great
promise by providing supervision and evaluation at each intermediate step,
thereby effectively improving the models' reasoning abilities. However,
training effective PRMs requires high-quality process reward data, yet existing
methods for constructing such data are often labour-intensive or inefficient.
In this paper, we propose an uncertainty-driven framework for automated process
reward data construction, encompassing both data generation and annotation
processes for PRMs. Additionally, we identify the limitations of both majority
vote and PRMs, and introduce two generic uncertainty-aware output aggregation
methods: Hybrid Majority Reward Vote and Weighted Reward Frequency Vote, which
combine the strengths of majority vote with PRMs. Extensive experiments on
ProcessBench, MATH, and GSMPlus show the effectiveness and efficiency of the
proposed PRM data construction framework, and demonstrate that the two output
aggregation methods further improve the mathematical reasoning abilities across
diverse PRMs. The code and data will be publicly available at
https://github.com/Jiuzhouh/UnPRM.