GenPRM: 생성적 추론을 통한 프로세스 보상 모델의 테스트 시간 계산 확장

초록

대규모 언어 모델(LLMs)의 최근 발전은 프로세스 보상 모델(PRMs)을 검증자로 활용하여 LLMs의 성능을 향상시키는 것이 유망함을 보여주었습니다. 그러나 현재의 PRMs는 세 가지 주요 과제에 직면해 있습니다: (1) 제한된 프로세스 감독 및 일반화 능력, (2) LLMs의 생성 능력을 활용하지 않은 스칼라 값 예측에 대한 의존성, (3) PRMs의 테스트 시점 계산을 확장할 수 없는 점. 본 연구에서는 각 추론 단계에 대한 판단을 제공하기 전에 명시적인 사고의 연쇄(CoT) 추론과 코드 검증을 수행하는 생성적 프로세스 보상 모델인 GenPRM을 소개합니다. 고품질의 프로세스 감독 레이블과 근거 데이터를 얻기 위해, 우리는 상대적 진행도 추정(RPE)과 코드 검증을 통합한 근거 합성 프레임워크를 제안합니다. ProcessBench 및 여러 수학적 추론 과제에서의 실험 결과는 GenPRM이 MATH 데이터셋의 단 23K 학습 데이터만으로도 기존 PRMs를 크게 능가함을 보여줍니다. 테스트 시점 확장을 통해, 1.5B GenPRM은 GPT-4o를 능가하고, 7B GenPRM은 ProcessBench에서 Qwen2.5-Math-PRM-72B를 초과합니다. 또한, GenPRM은 정책 모델 개선을 위한 비평 모델로서의 강력한 능력을 보여줍니다. 이 연구는 PRMs와 LLMs의 비평 모델 간의 격차를 해소하는 새로운 프로세스 감독 패러다임을 확립합니다. 우리의 코드, 모델, 데이터는 https://ryanliu112.github.io/GenPRM에서 공개될 예정입니다.

English

Recent advancements in Large Language Models (LLMs) have shown that it is promising to utilize Process Reward Models (PRMs) as verifiers to enhance the performance of LLMs. However, current PRMs face three key challenges: (1) limited process supervision and generalization capabilities, (2) dependence on scalar value prediction without leveraging the generative abilities of LLMs, and (3) inability to scale the test-time compute of PRMs. In this work, we introduce GenPRM, a generative process reward model that performs explicit Chain-of-Thought (CoT) reasoning with code verification before providing judgment for each reasoning step. To obtain high-quality process supervision labels and rationale data, we propose Relative Progress Estimation (RPE) and a rationale synthesis framework that incorporates code verification. Experimental results on ProcessBench and several mathematical reasoning tasks show that GenPRM significantly outperforms prior PRMs with only 23K training data from MATH dataset. Through test-time scaling, a 1.5B GenPRM outperforms GPT-4o, and a 7B GenPRM surpasses Qwen2.5-Math-PRM-72B on ProcessBench. Additionally, GenPRM demonstrates strong abilities to serve as a critic model for policy model refinement. This work establishes a new paradigm for process supervision that bridges the gap between PRMs and critic models in LLMs. Our code, model, and data will be available in https://ryanliu112.github.io/GenPRM.

GenPRM: 생성적 추론을 통한 프로세스 보상 모델의 테스트 시간 계산 확장

GenPRM: Scaling Test-Time Compute of Process Reward Models via Generative Reasoning

초록

Support