SCAN：自去噪蒙特卡洛标注，助力稳健过程奖励学习

摘要

过程奖励模型（PRMs）提供了细粒度的步骤级评估，促进了大语言模型（LLMs）中更深层次的推理过程，在数学推理等复杂任务中表现出色。然而，由于人工标注数据的高成本和有限的可扩展性，开发PRMs面临挑战。蒙特卡洛（MC）估计生成的合成数据是一个有前景的替代方案，但其高噪声比例可能导致过拟合，阻碍大规模训练。在本研究中，我们对MC估计合成数据中的噪声分布进行了初步研究，发现标注模型由于标注能力的限制，往往会低估或高估步骤的正确性。基于这些洞察，我们提出了自去噪蒙特卡洛标注（SCAN），一个高效的数据合成和噪声容忍学习框架。我们的关键发现表明：（1）即使轻量级模型（如1.5B参数）通过自去噪策略也能生成高质量标注，使PRMs仅需传统MC估计6%的推理成本即可实现卓越性能。（2）通过我们的鲁棒学习策略，PRMs能够有效学习这种弱监督，在ProcessBench中实现了39.2的F1分数提升（从19.9到59.1）。尽管仅使用紧凑的合成数据集，我们的模型仍超越了包括在PRM800K等大规模人工标注数据集上训练的强基线。此外，随着合成数据规模的扩大，性能持续提升，凸显了SCAN在可扩展、成本效益高且鲁棒的PRM训练中的潜力。

English

Process reward models (PRMs) offer fine-grained, step-level evaluations that facilitate deeper reasoning processes in large language models (LLMs), proving effective in complex tasks like mathematical reasoning. However, developing PRMs is challenging due to the high cost and limited scalability of human-annotated data. Synthetic data from Monte Carlo (MC) estimation is a promising alternative but suffers from a high noise ratio, which can cause overfitting and hinder large-scale training. In this work, we conduct a preliminary study on the noise distribution in synthetic data from MC estimation, identifying that annotation models tend to both underestimate and overestimate step correctness due to limitations in their annotation capabilities. Building on these insights, we propose Self-Denoising Monte Carlo Annotation (SCAN), an efficient data synthesis and noise-tolerant learning framework. Our key findings indicate that: (1) Even lightweight models (e.g., 1.5B parameters) can produce high-quality annotations through a self-denoising strategy, enabling PRMs to achieve superior performance with only 6% the inference cost required by vanilla MC estimation. (2) With our robust learning strategy, PRMs can effectively learn from this weak supervision, achieving a 39.2 F1 score improvement (from 19.9 to 59.1) in ProcessBench. Despite using only a compact synthetic dataset, our models surpass strong baselines, including those trained on large-scale human-annotated datasets such as PRM800K. Furthermore, performance continues to improve as we scale up the synthetic data, highlighting the potential of SCAN for scalable, cost-efficient, and robust PRM training.

SCAN：自去噪蒙特卡洛标注，助力稳健过程奖励学习

SCAN: Self-Denoising Monte Carlo Annotation for Robust Process Reward Learning

摘要

Support