重新思考计算高效测试时扩展的最佳验证粒度

摘要

测试时缩放（TTS）已被证明能有效提升大型语言模型（LLMs）的推理能力。验证在TTS中扮演着关键角色，它同时影响着（1）推理性能与（2）计算效率，这归因于验证的质量及其计算成本。在本研究中，我们挑战了传统的验证范式，首次系统性地探讨了验证粒度的影响——即验证器在生成过程中被调用的频率，而不仅仅局限于仅验证最终输出或单个生成步骤。为此，我们引入了可变粒度搜索（VG-Search），这是一种通过可调粒度参数g来泛化束搜索与最佳N采样（Best-of-N sampling）的统一算法。在不同计算预算、生成器-验证器配置及任务属性下，VG-Search的广泛实验表明，动态选择g能够提升计算效率与缩放行为。基于这些发现，我们提出了自适应VG-Search策略，相较于束搜索和最佳N采样，分别实现了高达3.1%和3.6%的准确率提升，同时减少了超过52%的浮点运算次数（FLOPs）。我们将开源代码，以支持未来的研究。

English

Test-time scaling (TTS) has proven effective in enhancing the reasoning capabilities of large language models (LLMs). Verification plays a key role in TTS, simultaneously influencing (1) reasoning performance and (2) compute efficiency, due to the quality and computational cost of verification. In this work, we challenge the conventional paradigms of verification, and make the first attempt toward systematically investigating the impact of verification granularity-that is, how frequently the verifier is invoked during generation, beyond verifying only the final output or individual generation steps. To this end, we introduce Variable Granularity Search (VG-Search), a unified algorithm that generalizes beam search and Best-of-N sampling via a tunable granularity parameter g. Extensive experiments with VG-Search under varying compute budgets, generator-verifier configurations, and task attributes reveal that dynamically selecting g can improve the compute efficiency and scaling behavior. Building on these findings, we propose adaptive VG-Search strategies that achieve accuracy gains of up to 3.1\% over Beam Search and 3.6\% over Best-of-N, while reducing FLOPs by over 52\%. We will open-source the code to support future research.

重新思考计算高效测试时扩展的最佳验证粒度

Rethinking Optimal Verification Granularity for Compute-Efficient Test-Time Scaling

摘要

Support