SCAN: ロバストなプロセス報酬学習のための自己ノイズ除去モンテカルロアノテーション

要旨

プロセス報酬モデル（PRM）は、大規模言語モデル（LLM）における深い推論プロセスを促進するための細かいステップレベルの評価を提供し、数学的推論などの複雑なタスクにおいて有効であることが証明されています。しかし、PRMの開発は、人間によるアノテーションデータの高コストとスケーラビリティの限界により困難です。モンテカルロ（MC）推定による合成データは有望な代替手段ですが、高いノイズ比率に悩まされており、過学習を引き起こし、大規模なトレーニングを妨げる可能性があります。本研究では、MC推定による合成データのノイズ分布に関する予備的な調査を行い、アノテーションモデルがそのアノテーション能力の限界により、ステップの正しさを過小評価および過大評価する傾向があることを明らかにしました。これらの知見に基づいて、効率的なデータ合成とノイズ耐性のある学習フレームワークであるSelf-Denoising Monte Carlo Annotation（SCAN）を提案します。主な発見は以下の通りです：（1）軽量なモデル（例：1.5Bパラメータ）でも、自己ノイズ除去戦略を通じて高品質のアノテーションを生成でき、PRMがバニラMC推定に必要な推論コストのわずか6％で優れた性能を達成できる。（2）我々の堅牢な学習戦略により、PRMはこの弱い教師信号から効果的に学習でき、ProcessBenchにおいて39.2のF1スコア向上（19.9から59.1）を達成する。コンパクトな合成データセットのみを使用しているにもかかわらず、我々のモデルはPRM800Kなどの大規模な人間によるアノテーションデータセットでトレーニングされた強力なベースラインを上回ります。さらに、合成データをスケールアップするにつれて性能が向上し続けることから、SCANがスケーラブルでコスト効率が高く、堅牢なPRMトレーニングの可能性を強調しています。

English

Process reward models (PRMs) offer fine-grained, step-level evaluations that facilitate deeper reasoning processes in large language models (LLMs), proving effective in complex tasks like mathematical reasoning. However, developing PRMs is challenging due to the high cost and limited scalability of human-annotated data. Synthetic data from Monte Carlo (MC) estimation is a promising alternative but suffers from a high noise ratio, which can cause overfitting and hinder large-scale training. In this work, we conduct a preliminary study on the noise distribution in synthetic data from MC estimation, identifying that annotation models tend to both underestimate and overestimate step correctness due to limitations in their annotation capabilities. Building on these insights, we propose Self-Denoising Monte Carlo Annotation (SCAN), an efficient data synthesis and noise-tolerant learning framework. Our key findings indicate that: (1) Even lightweight models (e.g., 1.5B parameters) can produce high-quality annotations through a self-denoising strategy, enabling PRMs to achieve superior performance with only 6% the inference cost required by vanilla MC estimation. (2) With our robust learning strategy, PRMs can effectively learn from this weak supervision, achieving a 39.2 F1 score improvement (from 19.9 to 59.1) in ProcessBench. Despite using only a compact synthetic dataset, our models surpass strong baselines, including those trained on large-scale human-annotated datasets such as PRM800K. Furthermore, performance continues to improve as we scale up the synthetic data, highlighting the potential of SCAN for scalable, cost-efficient, and robust PRM training.

SCAN: ロバストなプロセス報酬学習のための自己ノイズ除去モンテカルロアノテーション

SCAN: Self-Denoising Monte Carlo Annotation for Robust Process Reward Learning

要旨

Support