奖励科学探索过程：面向智能体数据分析的过程级奖励建模

摘要

过程奖励模型（PRMs）在数学等静态领域显著提升了大型语言模型（LLMs）的推理能力，但其在动态数据分析任务中的潜力尚未得到充分探索。本研究首先通过实证分析发现，通用领域PRMs难以有效监督数据分析智能体：它们既无法检测静默错误（即不引发解释器异常但导致错误结果的逻辑缺陷），又会错误惩罚探索性操作，将必要的试错探索误判为环境交互失败。为弥补这一缺陷，我们提出DataPRM——一种新型环境感知生成式过程奖励模型，其具备双重特性：（1）可作为主动验证器，通过自主与环境交互探测中间执行状态以发现静默错误；（2）采用反射感知的三元奖励策略，能区分可修正的环境交互错误与不可恢复的失误。我们设计可扩展的流程，通过多样性驱动的轨迹生成与知识增强的步骤级标注，构建了超过8K高质量DataPRM训练实例。实验结果表明，基于Best-of-N推理，DataPRM将下游策略LLMs在ScienceAgentBench和DABStep上的性能分别提升7.21%和11.28%。值得注意的是，仅需40亿参数的DataPRM即可超越强基线模型，并在多种测试时扩展策略中展现出稳健的泛化能力。进一步将DataPRM融入强化学习后，相较结果奖励基线取得显著增益，在DABench和TableBench上分别达到78.73%和64.84%的准确率，验证了过程奖励监督的有效性。代码已开源：https://github.com/zjunlp/DataMind。

English

Process Reward Models (PRMs) have achieved remarkable success in augmenting the reasoning capabilities of Large Language Models (LLMs) within static domains such as mathematics. However, their potential in dynamic data analysis tasks remains underexplored. In this work, we first present a empirical study revealing that general-domain PRMs struggle to supervise data analysis agents. Specifically, they fail to detect silent errors, logical flaws that yield incorrect results without triggering interpreter exceptions, and erroneously penalize exploratory actions, mistaking necessary trial-and-error exploration for grounding failures. To bridge this gap, we introduce DataPRM, a novel environment-aware generative process reward model that (1) can serve as an active verifier, autonomously interacting with the environment to probe intermediate execution states and uncover silent errors, and (2) employs a reflection-aware ternary reward strategy that distinguishes between correctable grounding errors and irrecoverable mistakes. We design a scalable pipeline to construct over 8K high-quality training instances for DataPRM via diversity-driven trajectory generation and knowledge-augmented step-level annotation. Experimental results demonstrate that DataPRM improves downstream policy LLMs by 7.21% on ScienceAgentBench and 11.28% on DABStep using Best-of-N inference. Notably, with only 4B parameters, DataPRM outperforms strong baselines, and exhibits robust generalizability across diverse Test-Time Scaling strategies. Furthermore, integrating DataPRM into Reinforcement Learning yields substantial gains over outcome-reward baselines, achieving 78.73% on DABench and 64.84% on TableBench, validating the effectiveness of process reward supervision. Code is available at https://github.com/zjunlp/DataMind.