수학적 추론에서 자동화된 과정 보상 데이터 구성 및 출력 집계를 위한 불확실성 기반 방법

초록

대형 언어 모델은 복잡한 수학적 추론 과제에서 뛰어난 능력을 보여주지만, 다단계 해결 과정에서 필연적으로 오류를 생성합니다. 과정 수준 보상 모델(PRM)은 각 중간 단계에서 감독과 평가를 제공함으로써 모델의 추론 능력을 효과적으로 향상시키는 데 큰 가능성을 보여주었습니다. 그러나 효과적인 PRM을 훈련시키기 위해서는 고품질의 과정 보상 데이터가 필요하며, 이러한 데이터를 구축하는 기존 방법들은 종종 노동 집약적이거나 비효율적입니다. 본 논문에서는 PRM을 위한 데이터 생성 및 주석 프로세스를 모두 포함하는 불확실성 기반 자동화된 과정 보상 데이터 구축 프레임워크를 제안합니다. 또한, 다수결 투표와 PRM의 한계를 식별하고, 다수결 투표와 PRM의 장점을 결합한 두 가지 일반적인 불확실성 인식 출력 집계 방법인 하이브리드 다수 보상 투표와 가중 보상 빈도 투표를 소개합니다. ProcessBench, MATH, GSMPlus에 대한 광범위한 실험을 통해 제안된 PRM 데이터 구축 프레임워크의 효과성과 효율성을 입증하고, 두 출력 집계 방법이 다양한 PRM에서 수학적 추론 능력을 더욱 향상시킴을 보여줍니다. 코드와 데이터는 https://github.com/Jiuzhouh/UnPRM에서 공개될 예정입니다.

English

Large language models have demonstrated remarkable capabilities in complex mathematical reasoning tasks, but they inevitably generate errors throughout multi-step solutions. Process-level Reward Models (PRMs) have shown great promise by providing supervision and evaluation at each intermediate step, thereby effectively improving the models' reasoning abilities. However, training effective PRMs requires high-quality process reward data, yet existing methods for constructing such data are often labour-intensive or inefficient. In this paper, we propose an uncertainty-driven framework for automated process reward data construction, encompassing both data generation and annotation processes for PRMs. Additionally, we identify the limitations of both majority vote and PRMs, and introduce two generic uncertainty-aware output aggregation methods: Hybrid Majority Reward Vote and Weighted Reward Frequency Vote, which combine the strengths of majority vote with PRMs. Extensive experiments on ProcessBench, MATH, and GSMPlus show the effectiveness and efficiency of the proposed PRM data construction framework, and demonstrate that the two output aggregation methods further improve the mathematical reasoning abilities across diverse PRMs. The code and data will be publicly available at https://github.com/Jiuzhouh/UnPRM.

수학적 추론에서 자동화된 과정 보상 데이터 구성 및 출력 집계를 위한 불확실성 기반 방법

Uncertainty-Based Methods for Automated Process Reward Data Construction and Output Aggregation in Mathematical Reasoning

초록

Support