GenPRM：通過生成式推理擴展過程獎勵模型的測試時計算能力

GenPRM: Scaling Test-Time Compute of Process Reward Models via Generative Reasoning

April 1, 2025

作者: Jian Zhao, Runze Liu, Kaiyan Zhang, Zhimu Zhou, Junqi Gao, Dong Li, Jiafei Lyu, Zhouyi Qian, Biqing Qi, Xiu Li, Bowen Zhou

cs.AI

摘要

近期大型語言模型（LLMs）的進展顯示，利用過程獎勵模型（PRMs）作為驗證器來提升LLMs的性能具有很大潛力。然而，當前的PRMs面臨三個主要挑戰：(1) 過程監督和泛化能力有限，(2) 依賴於標量值預測而未充分利用LLMs的生成能力，(3) 無法擴展PRMs在測試時的計算資源。在本研究中，我們提出了GenPRM，這是一種生成式過程獎勵模型，它在提供每個推理步驟的判斷之前，會進行顯式的思維鏈（CoT）推理並進行代碼驗證。為了獲取高質量的過程監督標籤和推理數據，我們提出了相對進度估計（RPE）以及一個結合代碼驗證的推理合成框架。在ProcessBench和幾個數學推理任務上的實驗結果表明，GenPRM僅使用MATH數據集中的23K訓練數據就顯著超越了先前的PRMs。通過測試時的規模擴展，1.5B的GenPRM超越了GPT-4o，而7B的GenPRM在ProcessBench上超越了Qwen2.5-Math-PRM-72B。此外，GenPRM展現了作為策略模型精煉的批評模型的強大能力。這項工作建立了一種新的過程監督範式，彌合了PRMs與LLMs中批評模型之間的差距。我們的代碼、模型和數據將在https://ryanliu112.github.io/GenPRM上公開。

English

Recent advancements in Large Language Models (LLMs) have shown that it is promising to utilize Process Reward Models (PRMs) as verifiers to enhance the performance of LLMs. However, current PRMs face three key challenges: (1) limited process supervision and generalization capabilities, (2) dependence on scalar value prediction without leveraging the generative abilities of LLMs, and (3) inability to scale the test-time compute of PRMs. In this work, we introduce GenPRM, a generative process reward model that performs explicit Chain-of-Thought (CoT) reasoning with code verification before providing judgment for each reasoning step. To obtain high-quality process supervision labels and rationale data, we propose Relative Progress Estimation (RPE) and a rationale synthesis framework that incorporates code verification. Experimental results on ProcessBench and several mathematical reasoning tasks show that GenPRM significantly outperforms prior PRMs with only 23K training data from MATH dataset. Through test-time scaling, a 1.5B GenPRM outperforms GPT-4o, and a 7B GenPRM surpasses Qwen2.5-Math-PRM-72B on ProcessBench. Additionally, GenPRM demonstrates strong abilities to serve as a critic model for policy model refinement. This work establishes a new paradigm for process supervision that bridges the gap between PRMs and critic models in LLMs. Our code, model, and data will be available in https://ryanliu112.github.io/GenPRM.