形式的検証ツールを用いたステップレベル推論検証器のトレーニング

要旨

プロセス報酬モデル（PRMs）は、大規模言語モデル（LLMs）によって生成された推論に対して段階的なフィードバックを提供するものとして、注目を集めています。しかし、二つの重要な研究課題が残されています。一つは、トレーニングのための正確なステップレベルのエラーレベルを収集するために、通常はコストのかかる人間のアノテーションが必要となることです。もう一つは、既存のPRMsが数学的推論問題に限定されていることです。これらの課題に対応するため、本論文では、自動的なデータセット作成とPRMsの多様な推論タスクへの汎用化という課題に取り組むことを目指しています。この目標を達成するために、我々はFoVerを提案します。FoVerは、形式的検証ツール（例えば、形式的論理のためのZ3や定理証明のためのIsabelleなど）によって自動的にアノテーションされたステップレベルのエラーレベルを用いてPRMsをトレーニングするアプローチです。これらのツールは、記号的なタスクに対して自動的かつ正確な検証を提供します。このアプローチを使用して、我々は、人間のアノテーションなしで、形式的論理や定理証明タスクに対するLLMの応答にエラーレベルを付与したトレーニングデータセットを合成します。このデータ合成は、形式的検証と互換性のあるタスクにのみ可能ですが、我々のデータセットでトレーニングされたLLMベースのPRMsが、タスク間の汎用性を示し、多様な推論タスクにわたる検証を改善することが観察されています。具体的には、FoVerでトレーニングされたPRMsは、元のLLMsに基づくベースラインPRMsを大幅に上回り、人間やより強力なモデルによってアノテーションされたラベルでトレーニングされた最先端のPRMsと比較しても、競争力のあるまたは優れた結果を達成します。これは、ProcessBenchでのステップレベルの検証や、MATH、AIME、ANLI、MMLU、BBHを含む12の推論ベンチマークにおけるBest-of-Kパフォーマンスによって測定されています。データセット、モデル、コードはhttps://github.com/psunlpgroup/FoVerで提供されています。

English

Process Reward Models (PRMs), which provide step-by-step feedback on the reasoning generated by Large Language Models (LLMs), are receiving increasing attention. However, two key research gaps remain: collecting accurate step-level error labels for training typically requires costly human annotation, and existing PRMs are limited to math reasoning problems. In response to these gaps, this paper aims to address the challenges of automatic dataset creation and the generalization of PRMs to diverse reasoning tasks. To achieve this goal, we propose FoVer, an approach for training PRMs on step-level error labels automatically annotated by formal verification tools, such as Z3 for formal logic and Isabelle for theorem proof, which provide automatic and accurate verification for symbolic tasks. Using this approach, we synthesize a training dataset with error labels on LLM responses for formal logic and theorem proof tasks without human annotation. Although this data synthesis is feasible only for tasks compatible with formal verification, we observe that LLM-based PRMs trained on our dataset exhibit cross-task generalization, improving verification across diverse reasoning tasks. Specifically, PRMs trained with FoVer significantly outperform baseline PRMs based on the original LLMs and achieve competitive or superior results compared to state-of-the-art PRMs trained on labels annotated by humans or stronger models, as measured by step-level verification on ProcessBench and Best-of-K performance across 12 reasoning benchmarks, including MATH, AIME, ANLI, MMLU, and BBH. The datasets, models, and code are provided at https://github.com/psunlpgroup/FoVer.

形式的検証ツールを用いたステップレベル推論検証器のトレーニング

Training Step-Level Reasoning Verifiers with Formal Verification Tools

要旨

Support