SCAN: 강력한 프로세스 보상 학습을 위한 자기 노이즈 제거 몬테카로 주석화

초록

프로세스 보상 모델(PRMs)은 대규모 언어 모델(LLMs)에서 더 깊은 추론 과정을 촉진하는 세분화된 단계별 평가를 제공하며, 수학적 추론과 같은 복잡한 작업에서 효과적임이 입증되었습니다. 그러나 인간이 주석을 단 데이터의 높은 비용과 제한된 확장성으로 인해 PRMs 개발은 어려운 과제입니다. 몬테카를로(MC) 추정에서 생성된 합성 데이터는 유망한 대안이지만 높은 노이즈 비율로 인해 과적합을 유발하고 대규모 학습을 방해할 수 있습니다. 본 연구에서는 MC 추정에서 생성된 합성 데이터의 노이즈 분포에 대한 예비 연구를 수행하며, 주석 모델이 주석 능력의 한계로 인해 단계 정확성을 과소평가하거나 과대평가하는 경향이 있음을 확인했습니다. 이러한 통찰을 바탕으로, 우리는 효율적인 데이터 합성 및 노이즈 내성 학습 프레임워크인 Self-Denoising Monte Carlo Annotation(SCAN)을 제안합니다. 주요 연구 결과는 다음과 같습니다: (1) 경량 모델(예: 1.5B 매개변수)도 자기 노이즈 제거 전략을 통해 고품질 주석을 생성할 수 있으며, 이를 통해 PRMs는 기존 MC 추정에 필요한 추론 비용의 6%만으로도 우수한 성능을 달성할 수 있습니다. (2) 우리의 강력한 학습 전략을 통해 PRMs는 이러한 약한 감독에서도 효과적으로 학습할 수 있으며, ProcessBench에서 39.2 F1 점수 향상(19.9에서 59.1로)을 달성했습니다. 소규모 합성 데이터셋만 사용했음에도 불구하고, 우리의 모델은 PRM800K와 같은 대규모 인간 주석 데이터셋으로 학습된 강력한 베이스라인을 능가했습니다. 또한 합성 데이터를 확장함에 따라 성능이 지속적으로 향상되어 SCAN이 확장 가능하고 비용 효율적이며 견고한 PRM 학습에 대한 잠재력을 보여줍니다.

English

Process reward models (PRMs) offer fine-grained, step-level evaluations that facilitate deeper reasoning processes in large language models (LLMs), proving effective in complex tasks like mathematical reasoning. However, developing PRMs is challenging due to the high cost and limited scalability of human-annotated data. Synthetic data from Monte Carlo (MC) estimation is a promising alternative but suffers from a high noise ratio, which can cause overfitting and hinder large-scale training. In this work, we conduct a preliminary study on the noise distribution in synthetic data from MC estimation, identifying that annotation models tend to both underestimate and overestimate step correctness due to limitations in their annotation capabilities. Building on these insights, we propose Self-Denoising Monte Carlo Annotation (SCAN), an efficient data synthesis and noise-tolerant learning framework. Our key findings indicate that: (1) Even lightweight models (e.g., 1.5B parameters) can produce high-quality annotations through a self-denoising strategy, enabling PRMs to achieve superior performance with only 6% the inference cost required by vanilla MC estimation. (2) With our robust learning strategy, PRMs can effectively learn from this weak supervision, achieving a 39.2 F1 score improvement (from 19.9 to 59.1) in ProcessBench. Despite using only a compact synthetic dataset, our models surpass strong baselines, including those trained on large-scale human-annotated datasets such as PRM800K. Furthermore, performance continues to improve as we scale up the synthetic data, highlighting the potential of SCAN for scalable, cost-efficient, and robust PRM training.

SCAN: 강력한 프로세스 보상 학습을 위한 자기 노이즈 제거 몬테카로 주석화

SCAN: Self-Denoising Monte Carlo Annotation for Robust Process Reward Learning

초록

Support