思考的過程獎勵模型
Process Reward Models That Think
April 23, 2025
作者: Muhammad Khalifa, Rishabh Agarwal, Lajanugen Logeswaran, Jaekyeom Kim, Hao Peng, Moontae Lee, Honglak Lee, Lu Wang
cs.AI
摘要
逐步驗證器——亦稱過程獎勵模型(PRMs)——是測試時擴展的關鍵要素。PRMs需要步驟級別的監督,這使得其訓練成本高昂。本研究旨在構建數據高效的PRMs,作為口語化的逐步獎勵模型,通過生成驗證思維鏈(CoT)來核實解決方案中的每一步。我們提出了ThinkPRM,這是一種長CoT驗證器,其微調所需的過程標籤數量遠少於判別式PRMs。我們的方法充分利用了長CoT模型固有的推理能力,在多個具有挑戰性的基準測試中,僅使用PRM800K中1%的過程標籤,便超越了LLM-as-a-Judge和判別式驗證器。具體而言,ThinkPRM在ProcessBench、MATH-500和AIME '24上,通過最佳N選擇和獎勵引導搜索,均優於基線模型。在GPQA-Diamond和LiveCodeBench子集上的跨域評估中,我們的PRM分別比使用完整PRM800K訓練的判別式驗證器高出8%和4.5%。最後,在相同的token預算下,ThinkPRM在驗證計算的擴展上比LLM-as-a-Judge更為有效,在ProcessBench子集上領先7.2%。我們的工作凸顯了生成式長CoT PRMs的價值,它們能夠在訓練時僅需極少監督的情況下,有效擴展測試時的驗證計算。我們的代碼、數據和模型將發佈於https://github.com/mukhal/thinkprm。
English
Step-by-step verifiers -- also known as process reward models (PRMs) -- are a
key ingredient for test-time scaling. PRMs require step-level supervision,
making them expensive to train. This work aims to build data-efficient PRMs as
verbalized step-wise reward models that verify every step in the solution by
generating a verification chain-of-thought (CoT). We propose ThinkPRM, a long
CoT verifier fine-tuned on orders of magnitude fewer process labels than those
required by discriminative PRMs. Our approach capitalizes on the inherent
reasoning abilities of long CoT models, and outperforms LLM-as-a-Judge and
discriminative verifiers -- using only 1% of the process labels in PRM800K --
across several challenging benchmarks. Specifically, ThinkPRM beats the
baselines on ProcessBench, MATH-500, and AIME '24 under best-of-N selection and
reward-guided search. In an out-of-domain evaluation on a subset of
GPQA-Diamond and LiveCodeBench, our PRM surpasses discriminative verifiers
trained on the full PRM800K by 8% and 4.5%, respectively. Lastly, under the
same token budget, ThinkPRM scales up verification compute more effectively
compared to LLM-as-a-Judge, outperforming it by 7.2% on a subset of
ProcessBench. Our work highlights the value of generative, long CoT PRMs that
can scale test-time compute for verification while requiring minimal
supervision for training. Our code, data, and models will be released at
https://github.com/mukhal/thinkprm.Summary
AI-Generated Summary