基於不確定性的數學推理過程獎勵數據自動構建與輸出聚合方法

摘要

大型語言模型在複雜數學推理任務中展現了顯著的能力，但在多步驟解題過程中不可避免地會產生錯誤。過程級獎勵模型（PRMs）通過在每個中間步驟提供監督和評估，顯示出極大的潛力，從而有效提升了模型的推理能力。然而，訓練有效的PRMs需要高質量的過程獎勵數據，而現有的構建此類數據的方法往往耗費人力或效率低下。本文提出了一種基於不確定性的自動化過程獎勵數據構建框架，涵蓋了PRMs的數據生成和註釋過程。此外，我們指出了多數投票和PRMs的局限性，並引入了兩種通用的不確定性感知輸出聚合方法：混合多數獎勵投票和加權獎勵頻率投票，這兩種方法結合了多數投票與PRMs的優勢。在ProcessBench、MATH和GSMPlus上的大量實驗表明，所提出的PRM數據構建框架具有高效性和有效性，並且這兩種輸出聚合方法進一步提升了多種PRMs的數學推理能力。代碼和數據將在https://github.com/Jiuzhouh/UnPRM公開提供。

English

Large language models have demonstrated remarkable capabilities in complex mathematical reasoning tasks, but they inevitably generate errors throughout multi-step solutions. Process-level Reward Models (PRMs) have shown great promise by providing supervision and evaluation at each intermediate step, thereby effectively improving the models' reasoning abilities. However, training effective PRMs requires high-quality process reward data, yet existing methods for constructing such data are often labour-intensive or inefficient. In this paper, we propose an uncertainty-driven framework for automated process reward data construction, encompassing both data generation and annotation processes for PRMs. Additionally, we identify the limitations of both majority vote and PRMs, and introduce two generic uncertainty-aware output aggregation methods: Hybrid Majority Reward Vote and Weighted Reward Frequency Vote, which combine the strengths of majority vote with PRMs. Extensive experiments on ProcessBench, MATH, and GSMPlus show the effectiveness and efficiency of the proposed PRM data construction framework, and demonstrate that the two output aggregation methods further improve the mathematical reasoning abilities across diverse PRMs. The code and data will be publicly available at https://github.com/Jiuzhouh/UnPRM.

基於不確定性的數學推理過程獎勵數據自動構建與輸出聚合方法

Uncertainty-Based Methods for Automated Process Reward Data Construction and Output Aggregation in Mathematical Reasoning

摘要

Support