基于判别验证的预算感知测试时缩放

摘要

测试时扩展是一种提升大型语言模型在复杂推理任务上性能的强大策略。尽管最先进的方法通常采用生成式验证器从候选方案池中筛选最佳解，但这种方法带来了难以承受的计算成本，限制了其实用性。在本研究中，我们将焦点转向一个更具预算意识的范式：判别式验证。我们进行了深入的实证分析，证明虽然判别式验证器单独使用时可能表现欠佳，但将其与自一致性结合形成混合方法后，能构建出一个强大且高效的测试时扩展机制。值得注意的是，在固定的计算预算下，这种混合方法显著超越了最先进的生成式验证：在AIME2025上实现了高达15.3%的准确率提升。我们的研究结果表明，对于实际应用场景，采用判别式验证器的预算意识扩展不仅是自一致性方法的“免费”升级，更是成本高昂的生成式技术的更有效、更高效的替代方案。代码已发布于https://github.com/wang-research-lab/verification。

English

Test-time scaling is a powerful strategy for boosting the performance of large language models on complex reasoning tasks. While state-of-the-art approaches often employ generative verifiers to select the best solution from a pool of candidates, this method incurs prohibitive computational costs, limiting its practicality. In this work, we shift the focus to a more budget-aware paradigm: discriminative verification. We conduct a thorough empirical analysis and demonstrate that while discriminative verifiers may underperform in isolation, combining them with self-consistency in a hybrid approach creates a powerful and efficient test-time scaling mechanism. Notably, under a fixed compute budget, this hybrid approach surpasses state-of-the-art generative verification by a significant margin: achieving up to 15.3\% higher accuracy on AIME2025. Our findings establish that for practical, real-world applications, budget-aware scaling with discriminative verifiers is not only a "free" upgrade over self-consistency, but also a more effective and efficient alternative to costly generative techniques. Code is available at https://github.com/wang-research-lab/verification.

基于判别验证的预算感知测试时缩放

Budget-aware Test-time Scaling via Discriminative Verification

摘要

Support