Solve-检测-验证:基于灵活生成验证器的推理时扩展
Solve-Detect-Verify: Inference-Time Scaling with Flexible Generative Verifier
May 17, 2025
作者: Jianyuan Zhong, Zeju Li, Zhijian Xu, Xiangyu Wen, Kezhi Li, Qiang Xu
cs.AI
摘要
大型语言模型(LLM)在处理复杂任务时的推理过程,本质上需要在解决方案的准确性与计算效率之间做出权衡。后续的验证步骤虽旨在提升性能,却因其自身带来的挑战性权衡而进一步复杂化了这一局面:若在测试时简单地将复杂的生成式奖励模型(GenRMs)与LLM集成,可能导致计算成本过高;而采用更简单、快速的方法,则可能牺牲可靠性。为应对这些挑战,我们引入了FlexiVe,一种新颖的生成式验证器,它通过灵活分配验证预算策略,在快速可靠的“快思考”与细致入微的“慢思考”之间灵活平衡计算资源。我们进一步提出了“解决-检测-验证”管道,这是一个高效的推理时扩展框架,它智能地整合了FlexiVe,主动识别解决方案完成点以触发针对性验证,并提供聚焦的求解器反馈。实验表明,FlexiVe在ProcessBench上识别推理轨迹中的错误方面达到了卓越的准确性。此外,在具有挑战性的数学推理基准测试(AIME 2024、AIME 2025和CNMO)中,我们的完整方法在推理准确性和推理效率上均优于如自洽性等基线方法。我们的系统为在测试时增强LLM推理提供了一种可扩展且有效的解决方案。
English
Large Language Model (LLM) reasoning for complex tasks inherently involves a
trade-off between solution accuracy and computational efficiency. The
subsequent step of verification, while intended to improve performance, further
complicates this landscape by introducing its own challenging trade-off:
sophisticated Generative Reward Models (GenRMs) can be computationally
prohibitive if naively integrated with LLMs at test-time, while simpler, faster
methods may lack reliability. To overcome these challenges, we introduce
FlexiVe, a novel generative verifier that flexibly balances computational
resources between rapid, reliable fast thinking and meticulous slow thinking
using a Flexible Allocation of Verification Budget strategy. We further propose
the Solve-Detect-Verify pipeline, an efficient inference-time scaling framework
that intelligently integrates FlexiVe, proactively identifying solution
completion points to trigger targeted verification and provide focused solver
feedback. Experiments show FlexiVe achieves superior accuracy in pinpointing
errors within reasoning traces on ProcessBench. Furthermore, on challenging
mathematical reasoning benchmarks (AIME 2024, AIME 2025, and CNMO), our full
approach outperforms baselines like self-consistency in reasoning accuracy and
inference efficiency. Our system offers a scalable and effective solution to
enhance LLM reasoning at test time.Summary
AI-Generated Summary