首終搜索：大型語言模型中的高效測試時縮放

摘要

測試時縮放（Test-time scaling, TTS）通過在推理過程中動態分配計算資源，為提升大型語言模型的推理能力提供了一條有前景的途徑。現有的TTS方法雖表現良好，但通常依賴於較長的解碼路徑或需要生成大量樣本，從而增加了令牌使用量和推理延遲。我們觀察到一個令人驚訝的事實：在推理任務中，較短的追蹤路徑比長路徑更有可能得出正確答案。基於此，我們引入了首達搜索（First Finish Search, FFS），這是一種無需訓練的並行解碼策略，它啟動n個獨立樣本並在任一完成時立即返回結果。我們將FFS與簡單解碼、束搜索、多數投票及預算強制等方法在四個推理模型（DeepSeek-R1、R1-Distill-Qwen-32B、QwQ-32B和Phi-4-Reasoning-Plus）及四個數據集（AIME24、AIME25-I、AIME25-II和GPQA Diamond）上進行了對比評估。使用DeepSeek-R1時，FFS在AIME數據集上達到了82.23%的準確率，相比DeepSeek-R1的獨立準確率提升了15%，幾乎與OpenAI的o4-mini性能持平。我們的理論分析闡明了為何在最短追蹤路徑處停止更可能獲得正確答案，並識別了早期停止可能次優的條件。FFS的優雅與簡潔證明了直觀的TTS策略能夠表現出色，揭示了在推理時簡單方法尚未開發的潛力。

English

Test-time scaling (TTS), which involves dynamic allocation of compute during inference, offers a promising way to improve reasoning in large language models. While existing TTS methods work well, they often rely on long decoding paths or require a large number of samples to be generated, increasing the token usage and inference latency. We observe the surprising fact that for reasoning tasks, shorter traces are much more likely to be correct than longer ones. Motivated by this, we introduce First Finish Search (FFS), a training-free parallel decoding strategy that launches n independent samples and returns as soon as any one completes. We evaluate FFS alongside simple decoding, beam search, majority voting, and budget forcing on four reasoning models (DeepSeek-R1, R1-Distill-Qwen-32B, QwQ-32B and Phi-4-Reasoning-Plus) and across four datasets (AIME24, AIME25-I, AIME25-II and GPQA Diamond). With DeepSeek-R1, FFS achieves 82.23% accuracy on the AIME datasets, a 15% improvement over DeepSeek-R1's standalone accuracy, nearly matching OpenAI's o4-mini performance. Our theoretical analysis explains why stopping at the shortest trace is likely to yield a correct answer and identifies the conditions under which early stopping may be suboptimal. The elegance and simplicity of FFS demonstrate that straightforward TTS strategies can perform remarkably well, revealing the untapped potential of simple approaches at inference time.

首終搜索：大型語言模型中的高效測試時縮放

First Finish Search: Efficient Test-Time Scaling in Large Language Models

摘要

Support