利用形式化验证工具训练步骤级推理验证器

Training Step-Level Reasoning Verifiers with Formal Verification Tools

May 21, 2025

作者: Ryo Kamoi, Yusen Zhang, Nan Zhang, Sarkar Snigdha Sarathi Das, Rui Zhang

cs.AI

摘要

过程奖励模型（PRMs）能够对大型语言模型（LLMs）生成的推理过程提供逐步反馈，正受到越来越多的关注。然而，当前研究仍存在两个关键空白：收集用于训练的精确步骤级错误标签通常需要昂贵的人工标注，且现有的PRMs仅限于数学推理问题。针对这些空白，本文旨在解决自动数据集创建和PRMs在多样化推理任务中泛化的挑战。为此，我们提出了FoVer方法，该方法利用形式验证工具（如用于形式逻辑的Z3和用于定理证明的Isabelle）自动标注步骤级错误标签来训练PRMs，这些工具为符号任务提供了自动且精确的验证。通过这一方法，我们合成了一个包含LLMs在形式逻辑和定理证明任务中响应错误标签的训练数据集，无需人工标注。尽管这种数据合成仅适用于与形式验证兼容的任务，但我们观察到，基于我们数据集训练的LLM-PRMs展现出跨任务泛化能力，提升了在多样化推理任务中的验证效果。具体而言，使用FoVer训练的PRMs在ProcessBench上的步骤级验证和12个推理基准（包括MATH、AIME、ANLI、MMLU和BBH）的Best-of-K性能上，显著优于基于原始LLMs的基线PRMs，并与使用人工或更强模型标注标签训练的最先进PRMs相比，取得了竞争性或更优的结果。数据集、模型和代码已发布于https://github.com/psunlpgroup/FoVer。

English

Process Reward Models (PRMs), which provide step-by-step feedback on the reasoning generated by Large Language Models (LLMs), are receiving increasing attention. However, two key research gaps remain: collecting accurate step-level error labels for training typically requires costly human annotation, and existing PRMs are limited to math reasoning problems. In response to these gaps, this paper aims to address the challenges of automatic dataset creation and the generalization of PRMs to diverse reasoning tasks. To achieve this goal, we propose FoVer, an approach for training PRMs on step-level error labels automatically annotated by formal verification tools, such as Z3 for formal logic and Isabelle for theorem proof, which provide automatic and accurate verification for symbolic tasks. Using this approach, we synthesize a training dataset with error labels on LLM responses for formal logic and theorem proof tasks without human annotation. Although this data synthesis is feasible only for tasks compatible with formal verification, we observe that LLM-based PRMs trained on our dataset exhibit cross-task generalization, improving verification across diverse reasoning tasks. Specifically, PRMs trained with FoVer significantly outperform baseline PRMs based on the original LLMs and achieve competitive or superior results compared to state-of-the-art PRMs trained on labels annotated by humans or stronger models, as measured by step-level verification on ProcessBench and Best-of-K performance across 12 reasoning benchmarks, including MATH, AIME, ANLI, MMLU, and BBH. The datasets, models, and code are provided at https://github.com/psunlpgroup/FoVer.