ファーストフィニッシュサーチ：大規模言語モデルにおける効率的なテスト時スケーリング

要旨

推論時の計算リソースを動的に割り当てるテストタイムスケーリング（TTS）は、大規模言語モデルの推論能力を向上させる有望な方法である。既存のTTS手法は有効であるが、長いデコードパスに依存したり、多数のサンプルを生成する必要があるため、トークン使用量や推論遅延が増加する傾向がある。我々は、推論タスクにおいて、短いトレースが長いトレースよりも正解である可能性が驚くほど高いという事実を観察した。これに基づき、n個の独立したサンプルを起動し、いずれかが完了した時点で結果を返す、トレーニング不要の並列デコード戦略であるFirst Finish Search（FFS）を提案する。FFSを、シンプルなデコード、ビームサーチ、多数決、予算強制とともに、4つの推論モデル（DeepSeek-R1、R1-Distill-Qwen-32B、QwQ-32B、Phi-4-Reasoning-Plus）および4つのデータセット（AIME24、AIME25-I、AIME25-II、GPQA Diamond）で評価した。DeepSeek-R1を用いた場合、FFSはAIMEデータセットで82.23%の精度を達成し、DeepSeek-R1の単体精度を15%向上させ、OpenAIのo4-miniの性能にほぼ匹敵する結果を示した。理論的分析により、最短のトレースで停止することが正解を得る可能性が高い理由を説明し、早期停止が最適でない条件を特定した。FFSの簡潔さとシンプルさは、単純なTTS戦略が驚くほど良好に機能することを示し、推論時に単純なアプローチが持つ未開拓の可能性を明らかにした。

English

Test-time scaling (TTS), which involves dynamic allocation of compute during inference, offers a promising way to improve reasoning in large language models. While existing TTS methods work well, they often rely on long decoding paths or require a large number of samples to be generated, increasing the token usage and inference latency. We observe the surprising fact that for reasoning tasks, shorter traces are much more likely to be correct than longer ones. Motivated by this, we introduce First Finish Search (FFS), a training-free parallel decoding strategy that launches n independent samples and returns as soon as any one completes. We evaluate FFS alongside simple decoding, beam search, majority voting, and budget forcing on four reasoning models (DeepSeek-R1, R1-Distill-Qwen-32B, QwQ-32B and Phi-4-Reasoning-Plus) and across four datasets (AIME24, AIME25-I, AIME25-II and GPQA Diamond). With DeepSeek-R1, FFS achieves 82.23% accuracy on the AIME datasets, a 15% improvement over DeepSeek-R1's standalone accuracy, nearly matching OpenAI's o4-mini performance. Our theoretical analysis explains why stopping at the shortest trace is likely to yield a correct answer and identifies the conditions under which early stopping may be suboptimal. The elegance and simplicity of FFS demonstrate that straightforward TTS strategies can perform remarkably well, revealing the untapped potential of simple approaches at inference time.

ファーストフィニッシュサーチ：大規模言語モデルにおける効率的なテスト時スケーリング

First Finish Search: Efficient Test-Time Scaling in Large Language Models

要旨

Support