ChatPaper.aiChatPaper

思考型过程奖励模型

Process Reward Models That Think

April 23, 2025
作者: Muhammad Khalifa, Rishabh Agarwal, Lajanugen Logeswaran, Jaekyeom Kim, Hao Peng, Moontae Lee, Honglak Lee, Lu Wang
cs.AI

摘要

逐步验证器——亦称过程奖励模型(PRMs)——是测试时扩展的关键要素。PRMs需要步骤级别的监督,这使得其训练成本高昂。本研究旨在构建数据高效的PRMs,作为通过生成验证链式思维(CoT)来验证解决方案中每一步的言语化逐步奖励模型。我们提出了ThinkPRM,这是一种长链CoT验证器,其微调所需的过程标签数量远少于判别式PRMs。我们的方法充分利用了长链CoT模型固有的推理能力,在多个具有挑战性的基准测试中,仅使用PRM800K中1%的过程标签,便超越了LLM-as-a-Judge和判别式验证器。具体而言,ThinkPRM在ProcessBench、MATH-500和AIME '24上,通过最佳N选和奖励引导搜索,均优于基线模型。在GPQA-Diamond和LiveCodeBench子集的跨域评估中,我们的PRM分别比在完整PRM800K上训练的判别式验证器高出8%和4.5%。最后,在相同的token预算下,ThinkPRM在验证计算扩展方面比LLM-as-a-Judge更为有效,在ProcessBench子集上以7.2%的优势胜出。我们的工作凸显了生成式长链CoT PRMs的价值,它们能够在训练时仅需极少监督的情况下,有效扩展测试时的验证计算。我们的代码、数据和模型将在https://github.com/mukhal/thinkprm 发布。
English
Step-by-step verifiers -- also known as process reward models (PRMs) -- are a key ingredient for test-time scaling. PRMs require step-level supervision, making them expensive to train. This work aims to build data-efficient PRMs as verbalized step-wise reward models that verify every step in the solution by generating a verification chain-of-thought (CoT). We propose ThinkPRM, a long CoT verifier fine-tuned on orders of magnitude fewer process labels than those required by discriminative PRMs. Our approach capitalizes on the inherent reasoning abilities of long CoT models, and outperforms LLM-as-a-Judge and discriminative verifiers -- using only 1% of the process labels in PRM800K -- across several challenging benchmarks. Specifically, ThinkPRM beats the baselines on ProcessBench, MATH-500, and AIME '24 under best-of-N selection and reward-guided search. In an out-of-domain evaluation on a subset of GPQA-Diamond and LiveCodeBench, our PRM surpasses discriminative verifiers trained on the full PRM800K by 8% and 4.5%, respectively. Lastly, under the same token budget, ThinkPRM scales up verification compute more effectively compared to LLM-as-a-Judge, outperforming it by 7.2% on a subset of ProcessBench. Our work highlights the value of generative, long CoT PRMs that can scale test-time compute for verification while requiring minimal supervision for training. Our code, data, and models will be released at https://github.com/mukhal/thinkprm.

Summary

AI-Generated Summary

PDF153April 25, 2025