ChatPaper.aiChatPaper

解碼-偵測-驗證:具備靈活生成驗證器的推理時擴展

Solve-Detect-Verify: Inference-Time Scaling with Flexible Generative Verifier

May 17, 2025
作者: Jianyuan Zhong, Zeju Li, Zhijian Xu, Xiangyu Wen, Kezhi Li, Qiang Xu
cs.AI

摘要

大型語言模型(LLM)在處理複雜任務時的推理過程,本質上涉及解決方案準確性與計算效率之間的權衡。後續的驗證步驟,雖然旨在提升性能,卻進一步複雜化了這一局面,因為它引入了自身具有挑戰性的權衡:如果將複雜的生成獎勵模型(GenRMs)在測試時與LLM簡單集成,可能會導致計算成本過高,而更簡單、更快的方法則可能缺乏可靠性。為克服這些挑戰,我們引入了FlexiVe,這是一種新型的生成驗證器,它通過靈活的驗證預算分配策略,在快速可靠的快速思維與細緻的慢速思維之間靈活平衡計算資源。我們進一步提出了Solve-Detect-Verify管道,這是一個高效的推理時擴展框架,它智能地整合了FlexiVe,主動識別解決方案的完成點以觸發有針對性的驗證,並提供聚焦的求解器反饋。實驗表明,FlexiVe在ProcessBench上精確定位推理軌跡中的錯誤方面達到了優異的準確性。此外,在具有挑戰性的數學推理基準測試(AIME 2024、AIME 2025和CNMO)上,我們的完整方法在推理準確性和推理效率方面均優於自洽性等基線。我們的系統提供了一種可擴展且有效的解決方案,以在測試時增強LLM的推理能力。
English
Large Language Model (LLM) reasoning for complex tasks inherently involves a trade-off between solution accuracy and computational efficiency. The subsequent step of verification, while intended to improve performance, further complicates this landscape by introducing its own challenging trade-off: sophisticated Generative Reward Models (GenRMs) can be computationally prohibitive if naively integrated with LLMs at test-time, while simpler, faster methods may lack reliability. To overcome these challenges, we introduce FlexiVe, a novel generative verifier that flexibly balances computational resources between rapid, reliable fast thinking and meticulous slow thinking using a Flexible Allocation of Verification Budget strategy. We further propose the Solve-Detect-Verify pipeline, an efficient inference-time scaling framework that intelligently integrates FlexiVe, proactively identifying solution completion points to trigger targeted verification and provide focused solver feedback. Experiments show FlexiVe achieves superior accuracy in pinpointing errors within reasoning traces on ProcessBench. Furthermore, on challenging mathematical reasoning benchmarks (AIME 2024, AIME 2025, and CNMO), our full approach outperforms baselines like self-consistency in reasoning accuracy and inference efficiency. Our system offers a scalable and effective solution to enhance LLM reasoning at test time.

Summary

AI-Generated Summary

PDF31May 21, 2025