SCAN：自去噪蒙特卡羅註釋技術，用於穩健的過程獎勵學習

摘要

過程獎勵模型（PRMs）提供了細粒度的步驟級評估，促進了大語言模型（LLMs）中更深層次的推理過程，在複雜任務如數學推理中表現出色。然而，由於人工標註數據的高成本和有限的可擴展性，開發PRMs面臨挑戰。來自蒙特卡羅（MC）估計的合成數據是一個有前景的替代方案，但其高噪聲比會導致過擬合，阻礙大規模訓練。在本研究中，我們對MC估計合成數據中的噪聲分佈進行了初步研究，發現標註模型由於其標註能力的限制，往往會低估和高估步驟的正確性。基於這些洞察，我們提出了自去噪蒙特卡羅標註（SCAN），這是一個高效的數據合成和噪聲容忍學習框架。我們的主要發現表明：（1）即使是輕量級模型（例如1.5B參數）通過自去噪策略也能產生高質量的標註，使PRMs僅需普通MC估計6%的推理成本即可實現優異性能。（2）通過我們穩健的學習策略，PRMs能夠有效地從這種弱監督中學習，在ProcessBench中實現了39.2 F1分數的提升（從19.9到59.1）。儘管僅使用了緊湊的合成數據集，我們的模型仍超越了包括在PRM800K等大規模人工標註數據集上訓練的強基線。此外，隨著合成數據的擴展，性能持續提升，凸顯了SCAN在可擴展、成本效益高且穩健的PRM訓練中的潛力。

English

Process reward models (PRMs) offer fine-grained, step-level evaluations that facilitate deeper reasoning processes in large language models (LLMs), proving effective in complex tasks like mathematical reasoning. However, developing PRMs is challenging due to the high cost and limited scalability of human-annotated data. Synthetic data from Monte Carlo (MC) estimation is a promising alternative but suffers from a high noise ratio, which can cause overfitting and hinder large-scale training. In this work, we conduct a preliminary study on the noise distribution in synthetic data from MC estimation, identifying that annotation models tend to both underestimate and overestimate step correctness due to limitations in their annotation capabilities. Building on these insights, we propose Self-Denoising Monte Carlo Annotation (SCAN), an efficient data synthesis and noise-tolerant learning framework. Our key findings indicate that: (1) Even lightweight models (e.g., 1.5B parameters) can produce high-quality annotations through a self-denoising strategy, enabling PRMs to achieve superior performance with only 6% the inference cost required by vanilla MC estimation. (2) With our robust learning strategy, PRMs can effectively learn from this weak supervision, achieving a 39.2 F1 score improvement (from 19.9 to 59.1) in ProcessBench. Despite using only a compact synthetic dataset, our models surpass strong baselines, including those trained on large-scale human-annotated datasets such as PRM800K. Furthermore, performance continues to improve as we scale up the synthetic data, highlighting the potential of SCAN for scalable, cost-efficient, and robust PRM training.

SCAN：自去噪蒙特卡羅註釋技術，用於穩健的過程獎勵學習

SCAN: Self-Denoising Monte Carlo Annotation for Robust Process Reward Learning

摘要

Support