사고하는 프로세스 보상 모델

초록

단계별 검증기(Step-by-step verifiers) -- 프로세스 보상 모델(Process Reward Models, PRMs)이라고도 불림 -- 는 테스트 시점 스케일링의 핵심 요소입니다. PRMs는 단계별 감독(supervision)이 필요하기 때문에 훈련 비용이 많이 듭니다. 본 연구는 데이터 효율적인 PRMs를 구축하는 것을 목표로 하며, 이를 위해 검증 사고 연쇄(verification chain-of-thought, CoT)를 생성하여 솔루션의 각 단계를 검증하는 언어화된 단계별 보상 모델을 제안합니다. 우리는 ThinkPRM을 제안하는데, 이는 판별적 PRMs에 비해 훨씬 적은 프로세스 레이블로 미세 조정된 긴 CoT 검증기입니다. 우리의 접근 방식은 긴 CoT 모델의 내재적 추론 능력을 활용하며, PRM800K의 프로세스 레이블 중 단 1%만 사용하여 LLM-as-a-Judge와 판별적 검증기를 여러 도전적인 벤치마크에서 능가합니다. 특히, ThinkPRM은 ProcessBench, MATH-500, AIME '24에서 best-of-N 선택과 보안 가이드 검색(reward-guided search) 하에서 베이스라인을 능가합니다. GPQA-Diamond와 LiveCodeBench의 부분 집합에 대한 도메인 외 평가에서, 우리의 PRM은 전체 PRM800K로 훈련된 판별적 검증기를 각각 8%와 4.5% 앞섭니다. 마지막으로, 동일한 토큰 예산 하에서 ThinkPRM은 LLM-as-a-Judge에 비해 검증 계산을 더 효과적으로 확장하며, ProcessBench의 부분 집합에서 7.2% 더 우수한 성능을 보입니다. 우리의 연구는 훈련에 최소한의 감독만 필요하면서도 검증을 위한 테스트 시점 계산을 확장할 수 있는 생성적이고 긴 CoT PRMs의 가치를 강조합니다. 우리의 코드, 데이터, 모델은 https://github.com/mukhal/thinkprm에서 공개될 예정입니다.

English

Step-by-step verifiers -- also known as process reward models (PRMs) -- are a key ingredient for test-time scaling. PRMs require step-level supervision, making them expensive to train. This work aims to build data-efficient PRMs as verbalized step-wise reward models that verify every step in the solution by generating a verification chain-of-thought (CoT). We propose ThinkPRM, a long CoT verifier fine-tuned on orders of magnitude fewer process labels than those required by discriminative PRMs. Our approach capitalizes on the inherent reasoning abilities of long CoT models, and outperforms LLM-as-a-Judge and discriminative verifiers -- using only 1% of the process labels in PRM800K -- across several challenging benchmarks. Specifically, ThinkPRM beats the baselines on ProcessBench, MATH-500, and AIME '24 under best-of-N selection and reward-guided search. In an out-of-domain evaluation on a subset of GPQA-Diamond and LiveCodeBench, our PRM surpasses discriminative verifiers trained on the full PRM800K by 8% and 4.5%, respectively. Lastly, under the same token budget, ThinkPRM scales up verification compute more effectively compared to LLM-as-a-Judge, outperforming it by 7.2% on a subset of ProcessBench. Our work highlights the value of generative, long CoT PRMs that can scale test-time compute for verification while requiring minimal supervision for training. Our code, data, and models will be released at https://github.com/mukhal/thinkprm.

사고하는 프로세스 보상 모델

Process Reward Models That Think

초록

Support