錯誤類型化以實現更智能的獎勵：通過錯誤感知的分層監督改進過程獎勵模型

摘要

大型語言模型（LLMs）容易產生幻覺，特別是在多跳躍和推理密集的任務中，如數學問題解決。雖然結果獎勵模型僅驗證最終答案，但過程獎勵模型（PRMs）則對每個中間步驟進行評分，以引導生成連貫的解決方案。我們介紹了PathFinder-PRM，這是一種新穎的層次化、錯誤感知的判別式PRM，它首先對每一步的數學和一致性錯誤進行分類，然後結合這些細粒度的信號來估計步驟的正確性。為了訓練PathFinder-PRM，我們通過豐富人類註釋的PRM800K語料庫和RLHFlow Mistral追蹤數據，構建了一個包含40萬個樣本的數據集，並添加了三維步驟級標籤。在PRMBench上，PathFinder-PRM達到了67.7的新PRMScore最高紀錄，超越了之前的最佳成績（65.5），同時使用的數據量減少了三倍。當應用於獎勵引導的貪婪搜索時，我們的模型在prm@8上達到了48.3，比最強的基線提高了1.5分。這些結果表明，解耦的錯誤檢測和獎勵估計不僅提升了細粒度的錯誤檢測能力，還顯著改善了端到端、獎勵引導的數學推理，並提高了數據效率。

English

Large Language Models (LLMs) are prone to hallucination, especially during multi-hop and reasoning-intensive tasks such as mathematical problem solving. While Outcome Reward Models verify only final answers, Process Reward Models (PRMs) score each intermediate step to steer generation toward coherent solutions. We introduce PathFinder-PRM, a novel hierarchical, error-aware discriminative PRM that first classifies math and consistency errors at each step, then combines these fine-grained signals to estimate step correctness. To train PathFinder-PRM, we construct a 400K-sample dataset by enriching the human-annotated PRM800K corpus and RLHFlow Mistral traces with three-dimensional step-level labels. On PRMBench, PathFinder-PRM achieves a new state-of-the-art PRMScore of 67.7, outperforming the prior best (65.5) while using 3 times less data. When applied to reward guided greedy search, our model yields prm@8 48.3, a +1.5 point gain over the strongest baseline. These results demonstrate that decoupled error detection and reward estimation not only boost fine-grained error detection but also substantially improve end-to-end, reward-guided mathematical reasoning with greater data efficiency.

錯誤類型化以實現更智能的獎勵：通過錯誤感知的分層監督改進過程獎勵模型

Error Typing for Smarter Rewards: Improving Process Reward Models with Error-Aware Hierarchical Supervision

摘要

Support