스마트 보상을 위한 오류 입력: 오류 인식 계층적 지도를 통해 프로세스 보상 모델 개선하기

초록

대규모 언어 모델(LLMs)은 특히 수학 문제 해결과 같은 다중 단계 추론이 필요한 작업에서 환각(hallucination) 현상이 발생하기 쉽습니다. 결과 보상 모델(Outcome Reward Models)은 최종 답변만을 검증하는 반면, 과정 보상 모델(Process Reward Models, PRMs)은 각 중간 단계를 평가하여 일관된 해결책으로의 생성을 유도합니다. 본 연구에서는 PathFinder-PRM이라는 새로운 계층적 오류 인식 판별형 PRM을 소개합니다. 이 모델은 먼저 각 단계에서 수학적 오류와 일관성 오류를 분류한 후, 이러한 세분화된 신호를 결합하여 단계별 정확성을 추정합니다. PathFinder-PRM을 학습시키기 위해, 우리는 인간 주석이 달린 PRM800K 코퍼스와 RLHFlow Mistral 트레이스를 3차원 단계별 레이블로 확장하여 400K 샘플 데이터셋을 구축했습니다. PRMBench에서 PathFinder-PRM은 67.7의 새로운 최고 PRMScore를 달성하며, 이전 최고 기록(65.5)을 능가하면서도 3배 적은 데이터를 사용했습니다. 보상 기반 탐욕적 탐색(reward guided greedy search)에 적용했을 때, 우리의 모델은 prm@8 48.3을 기록하여 가장 강력한 베이스라인 대비 +1.5 포인트의 향상을 보였습니다. 이러한 결과는 분리된 오류 탐지와 보상 추정이 세분화된 오류 탐지를 강화할 뿐만 아니라, 데이터 효율성을 높이면서도 종단 간 보안 기반 수학적 추론을 크게 개선할 수 있음을 보여줍니다.

English

Large Language Models (LLMs) are prone to hallucination, especially during multi-hop and reasoning-intensive tasks such as mathematical problem solving. While Outcome Reward Models verify only final answers, Process Reward Models (PRMs) score each intermediate step to steer generation toward coherent solutions. We introduce PathFinder-PRM, a novel hierarchical, error-aware discriminative PRM that first classifies math and consistency errors at each step, then combines these fine-grained signals to estimate step correctness. To train PathFinder-PRM, we construct a 400K-sample dataset by enriching the human-annotated PRM800K corpus and RLHFlow Mistral traces with three-dimensional step-level labels. On PRMBench, PathFinder-PRM achieves a new state-of-the-art PRMScore of 67.7, outperforming the prior best (65.5) while using 3 times less data. When applied to reward guided greedy search, our model yields prm@8 48.3, a +1.5 point gain over the strongest baseline. These results demonstrate that decoupled error detection and reward estimation not only boost fine-grained error detection but also substantially improve end-to-end, reward-guided mathematical reasoning with greater data efficiency.

스마트 보상을 위한 오류 입력: 오류 인식 계층적 지도를 통해 프로세스 보상 모델 개선하기

Error Typing for Smarter Rewards: Improving Process Reward Models with Error-Aware Hierarchical Supervision

초록

Support