首次完成搜索：大语言模型中的高效测试时扩展

摘要

测试时缩放（TTS）技术，通过在推理过程中动态分配计算资源，为提升大语言模型的推理能力提供了一条前景广阔的途径。尽管现有的TTS方法表现良好，但它们往往依赖于冗长的解码路径或需要生成大量样本，这增加了令牌使用量和推理延迟。我们观察到一个令人惊讶的现象：在推理任务中，较短的推理轨迹比长的更有可能正确。受此启发，我们提出了首次完成搜索（FFS），这是一种无需训练的并行解码策略，它启动n个独立样本，并在任一完成时立即返回。我们将FFS与简单解码、束搜索、多数投票和预算强制等方法一同评估，应用于四个推理模型（DeepSeek-R1、R1-Distill-Qwen-32B、QwQ-32B和Phi-4-Reasoning-Plus）及四个数据集（AIME24、AIME25-I、AIME25-II和GPQA Diamond）。在DeepSeek-R1上，FFS在AIME数据集上达到了82.23%的准确率，较DeepSeek-R1独立准确率提升了15%，几乎与OpenAI的o4-mini性能持平。我们的理论分析解释了为何在最短轨迹处停止更可能得到正确答案，并识别了早期停止可能次优的条件。FFS的优雅与简洁证明了，简单的TTS策略在推理时也能表现出色，揭示了简单方法在推理时刻尚未开发的潜力。

English

Test-time scaling (TTS), which involves dynamic allocation of compute during inference, offers a promising way to improve reasoning in large language models. While existing TTS methods work well, they often rely on long decoding paths or require a large number of samples to be generated, increasing the token usage and inference latency. We observe the surprising fact that for reasoning tasks, shorter traces are much more likely to be correct than longer ones. Motivated by this, we introduce First Finish Search (FFS), a training-free parallel decoding strategy that launches n independent samples and returns as soon as any one completes. We evaluate FFS alongside simple decoding, beam search, majority voting, and budget forcing on four reasoning models (DeepSeek-R1, R1-Distill-Qwen-32B, QwQ-32B and Phi-4-Reasoning-Plus) and across four datasets (AIME24, AIME25-I, AIME25-II and GPQA Diamond). With DeepSeek-R1, FFS achieves 82.23% accuracy on the AIME datasets, a 15% improvement over DeepSeek-R1's standalone accuracy, nearly matching OpenAI's o4-mini performance. Our theoretical analysis explains why stopping at the shortest trace is likely to yield a correct answer and identifies the conditions under which early stopping may be suboptimal. The elegance and simplicity of FFS demonstrate that straightforward TTS strategies can perform remarkably well, revealing the untapped potential of simple approaches at inference time.

首次完成搜索：大语言模型中的高效测试时扩展

First Finish Search: Efficient Test-Time Scaling in Large Language Models

摘要

Support