ChatPaper.aiChatPaper

GroundedPRM:基於樹引導與保真度感知的過程獎勵建模,用於步驟級推理

GroundedPRM: Tree-Guided and Fidelity-Aware Process Reward Modeling for Step-Level Reasoning

October 16, 2025
作者: Yao Zhang, Yu Wu, Haowei Zhang, Weiguo Li, Haokun Chen, Jingpei Wu, Guohao Li, Zhen Han, Volker Tresp
cs.AI

摘要

過程獎勵模型(PRMs)旨在通過監督中間步驟並識別錯誤來提升大型語言模型(LLMs)的多步推理能力。然而,由於缺乏可擴展的高質量註釋,構建有效的PRMs仍然具有挑戰性。現有方法依賴於昂貴的人工標註、易產生幻覺的LLM自我評估,或是僅從推演結果推斷步驟質量的蒙特卡洛(MC)估計,這些方法常因信用分配不當而引入噪聲和錯位的監督。這些問題導致了三個核心限制:噪聲獎勵、低事實保真度以及與步驟級推理目標的不一致。為應對這些挑戰,我們引入了GroundedPRM,這是一個基於樹引導和保真度感知的自動過程監督框架。為減少獎勵噪聲並實現細粒度信用分配,我們通過蒙特卡洛樹搜索(MCTS)構建結構化推理路徑。為消除幻覺監督,我們使用外部工具驗證每個中間步驟,提供基於執行的正確性信號。為結合步驟級驗證和全局結果評估,我們設計了一種混合獎勵聚合機制,融合了基於工具的驗證與MCTS衍生的反饋。最後,我們將獎勵信號格式化為增強推理的生成結構,以提升可解釋性並與指令調優的LLMs兼容。GroundedPRM僅在40K自動標註樣本上進行訓練,僅為使用自動標註監督的最佳PRM所用數據的10%。然而,它在ProcessBench上的平均性能相對提升了高達26%。當用於獎勵引導的貪婪搜索時,GroundedPRM甚至超越了使用人工標註監督訓練的PRMs,為高質量過程級推理提供了一條可擴展且可驗證的路徑。
English
Process Reward Models (PRMs) aim to improve multi-step reasoning in Large Language Models (LLMs) by supervising intermediate steps and identifying errors. However, building effective PRMs remains challenging due to the lack of scalable, high-quality annotations. Existing approaches rely on costly human labeling, LLM-based self-evaluation that is prone to hallucination, or Monte Carlo (MC) estimation, which infers step quality solely from rollout outcomes and often introduces noisy, misaligned supervision due to credit misattribution. These issues result in three core limitations: noisy rewards, low factual fidelity, and misalignment with step-level reasoning objectives. To address these challenges, we introduce GroundedPRM, a tree-guided and fidelity-aware framework for automatic process supervision. To reduce reward noise and enable fine-grained credit assignment, we construct structured reasoning paths via Monte Carlo Tree Search (MCTS). To eliminate hallucinated supervision, we validate each intermediate step using an external tool, providing execution-grounded correctness signals. To combine both step-level validation and global outcome assessment, we design a hybrid reward aggregation mechanism that fuses tool-based verification with MCTS-derived feedback. Finally, we format the reward signal into a rationale-enhanced, generative structure to promote interpretability and compatibility with instruction-tuned LLMs. GroundedPRM is trained on only 40K automatically labeled samples, amounting to just 10% of the data used by the best-performing PRM trained with auto-labeled supervision. Nevertheless, it achieves up to a 26% relative improvement in average performance on ProcessBench. When used for reward-guided greedy search, GroundedPRM outperforms even PRMs trained with human-labeled supervision, offering a scalable and verifiable path toward high-quality process-level reasoning.
PDF12October 17, 2025