ChatPaper.aiChatPaper

OPV:基于结果的过程验证器,实现高效长链思维验证

OPV: Outcome-based Process Verifier for Efficient Long Chain-of-Thought Verification

December 11, 2025
作者: Zijian Wu, Lingkai Kong, Wenwei Zhang, Songyang Gao, Yuzhe Gu, Zhongrui Cai, Tianyou Ma, Yuhong Liu, Zhi Wang, Runyuan Ma, Guangyu Wang, Wei Li, Conghui He, Dahua Lin, Kai Chen
cs.AI

摘要

大型语言模型(LLMs)通过可验证奖励的强化学习(RLVR)在解决复杂推理任务方面取得了显著进展。这一进步同样离不开基于可靠验证器的自动化监督。然而,当前基于结果的验证器(OVs)难以检查长思维链(CoTs)中不可靠的中间步骤;而现有基于过程的验证器(PVs)受限于人工标注的高昂成本导致高质量标注稀缺,难以可靠检测复杂长链推理中的错误。为此,我们提出基于结果的过程验证器(OPV),通过验证长思维链中总结性结果的推导过程,实现精准高效的验证并支持大规模标注。为增强该验证器能力,我们采用专家标注的迭代式主动学习框架,以较低标注成本逐步提升OPV的验证性能。具体而言,每轮迭代中标注当前最优OPV最不确定的案例,随后通过拒绝微调(RFT)和RLVR训练新一代OPV。大量实验表明OPV具有卓越性能与广泛适用性:在自建评测集OPV-Bench上以83.1的F1分数刷新最优结果,显著超越Qwen3-Max-Preview等更大规模开源模型的76.3分;在合成数据集中能有效识别假阳性案例,与专家评估高度一致;与策略模型协同工作时,OPV持续带来性能提升,例如在AIME2025任务中随着计算预算增加,将DeepSeek-R1-Distill-Qwen-32B的准确率从55.2%提升至73.3%。
English
Large language models (LLMs) have achieved significant progress in solving complex reasoning tasks by Reinforcement Learning with Verifiable Rewards (RLVR). This advancement is also inseparable from the oversight automated by reliable verifiers. However, current outcome-based verifiers (OVs) are unable to inspect the unreliable intermediate steps in the long reasoning chains of thought (CoTs). Meanwhile, current process-based verifiers (PVs) have difficulties in reliably detecting errors in the complex long CoTs, limited by the scarcity of high-quality annotations due to the prohibitive costs of human annotations. Therefore, we propose the Outcome-based Process Verifier (OPV), which verifies the rationale process of summarized outcomes from long CoTs to achieve both accurate and efficient verification and enable large-scale annotation. To empower the proposed verifier, we adopt an iterative active learning framework with expert annotations to progressively improve the verification capability of OPV with fewer annotation costs. Specifically, in each iteration, the most uncertain cases of the current best OPV are annotated and then subsequently used to train a new OPV through Rejection Fine-Tuning (RFT) and RLVR for the next round. Extensive experiments demonstrate OPV's superior performance and broad applicability. It achieves new state-of-the-art results on our held-out OPV-Bench, outperforming much larger open-source models such as Qwen3-Max-Preview with an F1 score of 83.1 compared to 76.3. Furthermore, OPV effectively detects false positives within synthetic dataset, closely align with expert assessment. When collaborating with policy models, OPV consistently yields performance gains, e.g., raising the accuracy of DeepSeek-R1-Distill-Qwen-32B from 55.2% to 73.3% on AIME2025 as the compute budget scales.
PDF301December 13, 2025