訓練步驟級推理驗證器與形式驗證工具
Training Step-Level Reasoning Verifiers with Formal Verification Tools
May 21, 2025
作者: Ryo Kamoi, Yusen Zhang, Nan Zhang, Sarkar Snigdha Sarathi Das, Rui Zhang
cs.AI
摘要
过程奖励模型(Process Reward Models, PRMs)正日益受到关注,该模型能够对大型语言模型(Large Language Models, LLMs)生成的推理过程提供逐步反馈。然而,当前研究仍存在两大关键空白:收集用于训练的精确步骤级错误标签通常需要昂贵的人工标注,且现有的PRMs仅限于数学推理问题。针对这些空白,本文旨在解决自动数据集创建及PRMs在多样化推理任务中泛化的挑战。为此,我们提出了FoVer方法,该方法利用形式验证工具(如用于形式逻辑的Z3和用于定理证明的Isabelle)自动标注步骤级错误标签来训练PRMs,这些工具为符号任务提供了自动且精确的验证。通过此方法,我们合成了一个训练数据集,其中包含针对形式逻辑和定理证明任务的LLM响应的错误标签,且无需人工标注。尽管这种数据合成仅适用于与形式验证兼容的任务,但我们观察到,基于我们数据集训练的LLM-PRMs展现出跨任务泛化能力,提升了在多样化推理任务中的验证效果。具体而言,采用FoVer训练的PRMs在ProcessBench上的步骤级验证及跨12个推理基准(包括MATH、AIME、ANLI、MMLU和BBH)的Best-of-K性能评估中,显著优于基于原始LLMs的基线PRMs,并与基于人工标注或更强模型训练的最先进PRMs相比,取得了竞争性或更优的结果。相关数据集、模型及代码已发布于https://github.com/psunlpgroup/FoVer。
English
Process Reward Models (PRMs), which provide step-by-step feedback on the
reasoning generated by Large Language Models (LLMs), are receiving increasing
attention. However, two key research gaps remain: collecting accurate
step-level error labels for training typically requires costly human
annotation, and existing PRMs are limited to math reasoning problems. In
response to these gaps, this paper aims to address the challenges of automatic
dataset creation and the generalization of PRMs to diverse reasoning tasks. To
achieve this goal, we propose FoVer, an approach for training PRMs on
step-level error labels automatically annotated by formal verification tools,
such as Z3 for formal logic and Isabelle for theorem proof, which provide
automatic and accurate verification for symbolic tasks. Using this approach, we
synthesize a training dataset with error labels on LLM responses for formal
logic and theorem proof tasks without human annotation. Although this data
synthesis is feasible only for tasks compatible with formal verification, we
observe that LLM-based PRMs trained on our dataset exhibit cross-task
generalization, improving verification across diverse reasoning tasks.
Specifically, PRMs trained with FoVer significantly outperform baseline PRMs
based on the original LLMs and achieve competitive or superior results compared
to state-of-the-art PRMs trained on labels annotated by humans or stronger
models, as measured by step-level verification on ProcessBench and Best-of-K
performance across 12 reasoning benchmarks, including MATH, AIME, ANLI, MMLU,
and BBH. The datasets, models, and code are provided at
https://github.com/psunlpgroup/FoVer.Summary
AI-Generated Summary