비지도 과정 보상 모델

초록

프로세스 보상 모델(PRM)은 세분화된 단계별 감독을 제공하여 대규모 언어 모델의 추론을 유도하는 강력한 메커니즘이다. 그러나 이러한 효과성은 상당한 비용을 수반한다. 즉, PRM은 모든 추론 단계에 대해 전문가의 주석을 필요로 하므로 비용이 많이 들고 확장이 어렵다. 본 연구에서는 단계별 주석 수준이나 최종 답변의 정답(ground-truth) 검증 모두에서 인간의 감독을 필요로 하지 않는 비지도 PRM(uPRM) 훈련 방법을 제안한다. 이 접근법의 핵심 아이디어는 LLM의 다음 토큰 확률로부터 유도된 점수 함수를 정의하여, 일괄 처리된 추론 경로들에서 첫 번째 오류 단계의 후보 위치를 공동으로 평가하는 것이다. 우리는 다양한 시나리오에서 uPRM의 효과성을 입증한다: (i) uPRM은 ProcessBench 데이터셋에서 첫 번째 오류 단계 식별에 있어 LLM-판사(LLM-as-a-Judge) 대비 최대 15%의 절대적 정확도 향상을 달성한다; (ii) 테스트 시간 확장을 위한 검증기로서 uPRM은 지도 PRM과 유사한 성능을 보이며, 다수결 투표 기준선 대비 최대 6.9% 향상된 성능을 나타낸다; (iii) 강화 학습에서 보상 신호로 사용될 때, uPRM은 정답 레이블을 사용하여 훈련된 지도 PRM에 비해 훈련 전반에 걸쳐 더 강건한 정책 최적화를 가능하게 한다. 전반적으로, 우리의 결과는 복잡한 추론 작업을 위한 확장 가능한 보상 모델링의 길을 열어준다.

English

Process Reward Models (PRMs) are a powerful mechanism for steering large language model reasoning by providing fine-grained, step-level supervision. However, this effectiveness comes at a significant cost: PRMs require expert annotations for every reasoning step, making them costly and difficult to scale. Here, we propose a method for training unsupervised PRMs (uPRM) that requires no human supervision, neither at the level of step-by-step annotations nor through ground-truth verification of final answers. The key idea behind our approach is to define a scoring function, derived from LLM next-token probabilities, that jointly assesses candidate positions of first erroneous steps across a batch of reasoning trajectories. We demonstrate the effectiveness of uPRM across diverse scenarios: (i) uPRM achieves up to 15% absolute accuracy improvements over the LLM-as-a-Judge in identifying first erroneous steps on the ProcessBench dataset; (ii) as a verifier for test-time scaling, uPRM performs comparably to supervised PRMs and outperforms the majority voting baseline by up to 6.9%, and (iii) when used as a reward signal in reinforcement learning, uPRM enables more robust policy optimization throughout training compared to a supervised PRM trained using ground-truth labels. Overall, our results open a path toward scalable reward modeling for complex reasoning tasks.