基于不确定性的自动化过程奖励数据构建与数学推理输出聚合方法

摘要

大型语言模型在复杂数学推理任务中展现了卓越的能力，但在多步解答过程中不可避免地会产生错误。过程级奖励模型（PRMs）通过在每一步中间步骤提供监督和评估，显示出极大的潜力，从而有效提升了模型的推理能力。然而，训练高效的PRMs需要高质量的过程奖励数据，而现有构建此类数据的方法往往劳动密集或效率低下。本文提出了一种基于不确定性的自动化过程奖励数据构建框架，涵盖了PRMs的数据生成和标注过程。此外，我们指出了多数投票和PRMs的局限性，并引入了两种通用的不确定性感知输出聚合方法：混合多数奖励投票和加权奖励频率投票，这两种方法结合了多数投票与PRMs的优势。在ProcessBench、MATH和GSMPlus上的大量实验验证了所提出的PRM数据构建框架的有效性和效率，并证明这两种输出聚合方法进一步提升了多种PRMs的数学推理能力。代码和数据将公开于https://github.com/Jiuzhouh/UnPRM。

English

Large language models have demonstrated remarkable capabilities in complex mathematical reasoning tasks, but they inevitably generate errors throughout multi-step solutions. Process-level Reward Models (PRMs) have shown great promise by providing supervision and evaluation at each intermediate step, thereby effectively improving the models' reasoning abilities. However, training effective PRMs requires high-quality process reward data, yet existing methods for constructing such data are often labour-intensive or inefficient. In this paper, we propose an uncertainty-driven framework for automated process reward data construction, encompassing both data generation and annotation processes for PRMs. Additionally, we identify the limitations of both majority vote and PRMs, and introduce two generic uncertainty-aware output aggregation methods: Hybrid Majority Reward Vote and Weighted Reward Frequency Vote, which combine the strengths of majority vote with PRMs. Extensive experiments on ProcessBench, MATH, and GSMPlus show the effectiveness and efficiency of the proposed PRM data construction framework, and demonstrate that the two output aggregation methods further improve the mathematical reasoning abilities across diverse PRMs. The code and data will be publicly available at https://github.com/Jiuzhouh/UnPRM.

基于不确定性的自动化过程奖励数据构建与数学推理输出聚合方法

Uncertainty-Based Methods for Automated Process Reward Data Construction and Output Aggregation in Mathematical Reasoning

摘要

Support