ChatPaper.aiChatPaper

基於辨識驗證的預算感知型測試時縮放策略

Budget-aware Test-time Scaling via Discriminative Verification

October 16, 2025
作者: Kyle Montgomery, Sijun Tan, Yuqi Chen, Siyuan Zhuang, Tianjun Zhang, Raluca Ada Popa, Chenguang Wang
cs.AI

摘要

測試時擴展是一種提升大型語言模型在複雜推理任務上表現的強力策略。雖然最先進的方法通常採用生成式驗證器從候選方案中選出最佳解,但這種方法會帶來高昂的計算成本,限制了其實用性。在本研究中,我們將焦點轉向一種更注重預算的範式:判別式驗證。我們進行了深入的實證分析,並證明儘管判別式驗證器在單獨使用時可能表現不佳,但將其與自我一致性結合於混合方法中,則能創造出一種強大且高效的測試時擴展機制。值得注意的是,在固定的計算預算下,這種混合方法顯著超越了最先進的生成式驗證:在AIME2025上實現了高達15.3%的準確率提升。我們的研究結果表明,在實際的現實世界應用中,基於判別式驗證器的預算感知擴展不僅是對自我一致性的“免費”升級,更是對昂貴生成技術的一種更有效且高效的替代方案。代碼可在https://github.com/wang-research-lab/verification獲取。
English
Test-time scaling is a powerful strategy for boosting the performance of large language models on complex reasoning tasks. While state-of-the-art approaches often employ generative verifiers to select the best solution from a pool of candidates, this method incurs prohibitive computational costs, limiting its practicality. In this work, we shift the focus to a more budget-aware paradigm: discriminative verification. We conduct a thorough empirical analysis and demonstrate that while discriminative verifiers may underperform in isolation, combining them with self-consistency in a hybrid approach creates a powerful and efficient test-time scaling mechanism. Notably, under a fixed compute budget, this hybrid approach surpasses state-of-the-art generative verification by a significant margin: achieving up to 15.3\% higher accuracy on AIME2025. Our findings establish that for practical, real-world applications, budget-aware scaling with discriminative verifiers is not only a "free" upgrade over self-consistency, but also a more effective and efficient alternative to costly generative techniques. Code is available at https://github.com/wang-research-lab/verification.
PDF42October 17, 2025