基於結果的流程驗證器:實現高效長鏈思維驗證
OPV: Outcome-based Process Verifier for Efficient Long Chain-of-Thought Verification
December 11, 2025
作者: Zijian Wu, Lingkai Kong, Wenwei Zhang, Songyang Gao, Yuzhe Gu, Zhongrui Cai, Tianyou Ma, Yuhong Liu, Zhi Wang, Runyuan Ma, Guangyu Wang, Wei Li, Conghui He, Dahua Lin, Kai Chen
cs.AI
摘要
大型語言模型(LLMs)通過可驗證獎勵的強化學習(RLVR)在解決複雜推理任務方面取得了顯著進展。這一進步同樣離不開由可靠驗證器實現的自動化監督。然而,當前基於結果的驗證器(OVs)無法有效審查長鏈思維推理(CoTs)中不可靠的中間步驟。與此同時,現有的基於過程的驗證器(PVs)受限於人工標註成本過高導致的高質量標註數據稀缺,難以在複雜的長鏈CoTs中可靠檢測錯誤。為此,我們提出基於結果的過程驗證器(OPV),通過驗證長鏈CoTs中歸納結果的推導過程,實現精確高效的驗證並支持大規模標註。為增強該驗證器的能力,我們採用結合專家標註的迭代式主動學習框架,以較低標註成本逐步提升OPV的驗證能力。具體而言,在每輪迭代中,當前最優OPV判定最不確定的案例會經專家標註,隨後通過拒絕微調(RFT)和RLVR訓練新一代OPV。大量實驗證明OPV具有卓越性能與廣泛適用性:在我們構建的OPV-Bench測試集上創下新紀錄,F1分達83.1,顯著超越Qwen3-Max-Preview等更大規模開源模型的76.3分;在合成數據集中能有效檢測假陽性案例,與專家評估高度一致;與策略模型協作時持續帶來性能提升,例如在AIME2025數據集上,隨著計算預算增加,可將DeepSeek-R1-Distill-Qwen-32B的準確率從55.2%提升至73.3%。
English
Large language models (LLMs) have achieved significant progress in solving complex reasoning tasks by Reinforcement Learning with Verifiable Rewards (RLVR). This advancement is also inseparable from the oversight automated by reliable verifiers. However, current outcome-based verifiers (OVs) are unable to inspect the unreliable intermediate steps in the long reasoning chains of thought (CoTs). Meanwhile, current process-based verifiers (PVs) have difficulties in reliably detecting errors in the complex long CoTs, limited by the scarcity of high-quality annotations due to the prohibitive costs of human annotations. Therefore, we propose the Outcome-based Process Verifier (OPV), which verifies the rationale process of summarized outcomes from long CoTs to achieve both accurate and efficient verification and enable large-scale annotation. To empower the proposed verifier, we adopt an iterative active learning framework with expert annotations to progressively improve the verification capability of OPV with fewer annotation costs. Specifically, in each iteration, the most uncertain cases of the current best OPV are annotated and then subsequently used to train a new OPV through Rejection Fine-Tuning (RFT) and RLVR for the next round. Extensive experiments demonstrate OPV's superior performance and broad applicability. It achieves new state-of-the-art results on our held-out OPV-Bench, outperforming much larger open-source models such as Qwen3-Max-Preview with an F1 score of 83.1 compared to 76.3. Furthermore, OPV effectively detects false positives within synthetic dataset, closely align with expert assessment. When collaborating with policy models, OPV consistently yields performance gains, e.g., raising the accuracy of DeepSeek-R1-Distill-Qwen-32B from 55.2% to 73.3% on AIME2025 as the compute budget scales.