ChatPaper.aiChatPaper

GroundedPRM:基于树引导与保真度感知的过程奖励建模,用于步骤级推理

GroundedPRM: Tree-Guided and Fidelity-Aware Process Reward Modeling for Step-Level Reasoning

October 16, 2025
作者: Yao Zhang, Yu Wu, Haowei Zhang, Weiguo Li, Haokun Chen, Jingpei Wu, Guohao Li, Zhen Han, Volker Tresp
cs.AI

摘要

过程奖励模型(PRMs)旨在通过监督中间步骤并识别错误,提升大型语言模型(LLMs)的多步推理能力。然而,构建有效的PRMs仍面临挑战,主要源于缺乏可扩展的高质量标注。现有方法依赖于成本高昂的人工标注、易产生幻觉的LLM自评估,或蒙特卡洛(MC)估计——后者仅从最终结果推断步骤质量,常因信用分配不当引入噪声和不对齐的监督。这些问题导致了三大核心局限:奖励噪声大、事实保真度低以及与步骤级推理目标不对齐。为应对这些挑战,我们提出了GroundedPRM,一个树引导且保真度感知的自动过程监督框架。为减少奖励噪声并实现细粒度信用分配,我们通过蒙特卡洛树搜索(MCTS)构建结构化推理路径。为消除幻觉监督,我们利用外部工具验证每个中间步骤,提供基于执行的正确性信号。为结合步骤级验证与全局结果评估,我们设计了一种混合奖励聚合机制,融合工具验证与MCTS反馈。最后,我们将奖励信号格式化为增强解释性的生成结构,以提升与指令调优LLMs的兼容性。GroundedPRM仅需在4万自动标注样本上训练,仅为使用自动标注监督的最佳PRM所需数据的10%,却在ProcessBench上实现了高达26%的平均性能相对提升。当用于奖励引导的贪婪搜索时,GroundedPRM甚至超越了基于人工标注监督训练的PRMs,为高质量过程级推理提供了一条可扩展且可验证的路径。
English
Process Reward Models (PRMs) aim to improve multi-step reasoning in Large Language Models (LLMs) by supervising intermediate steps and identifying errors. However, building effective PRMs remains challenging due to the lack of scalable, high-quality annotations. Existing approaches rely on costly human labeling, LLM-based self-evaluation that is prone to hallucination, or Monte Carlo (MC) estimation, which infers step quality solely from rollout outcomes and often introduces noisy, misaligned supervision due to credit misattribution. These issues result in three core limitations: noisy rewards, low factual fidelity, and misalignment with step-level reasoning objectives. To address these challenges, we introduce GroundedPRM, a tree-guided and fidelity-aware framework for automatic process supervision. To reduce reward noise and enable fine-grained credit assignment, we construct structured reasoning paths via Monte Carlo Tree Search (MCTS). To eliminate hallucinated supervision, we validate each intermediate step using an external tool, providing execution-grounded correctness signals. To combine both step-level validation and global outcome assessment, we design a hybrid reward aggregation mechanism that fuses tool-based verification with MCTS-derived feedback. Finally, we format the reward signal into a rationale-enhanced, generative structure to promote interpretability and compatibility with instruction-tuned LLMs. GroundedPRM is trained on only 40K automatically labeled samples, amounting to just 10% of the data used by the best-performing PRM trained with auto-labeled supervision. Nevertheless, it achieves up to a 26% relative improvement in average performance on ProcessBench. When used for reward-guided greedy search, GroundedPRM outperforms even PRMs trained with human-labeled supervision, offering a scalable and verifiable path toward high-quality process-level reasoning.
PDF12October 17, 2025