퍼스트 피니시 서치: 대규모 언어 모델에서의 효율적인 테스트 타임 스케일링

초록

추론 시점에 컴퓨팅 자원을 동적으로 할당하는 테스트 타임 스케일링(TTS)은 대규모 언어 모델의 추론 능력을 향상시키는 유망한 방법으로 주목받고 있습니다. 기존 TTS 방법들은 효과적이지만, 긴 디코딩 경로를 필요로 하거나 많은 수의 샘플을 생성해야 하여 토큰 사용량과 추론 지연 시간을 증가시키는 경향이 있습니다. 우리는 흥미로운 사실을 관찰했는데, 추론 과제에서는 짧은 추적 경로가 긴 경로보다 정답일 가능성이 훨씬 높다는 점입니다. 이를 바탕으로, 우리는 n개의 독립적인 샘플을 시작하고 그 중 하나라도 완료되면 즉시 반환하는 훈련이 필요 없는 병렬 디코딩 전략인 First Finish Search(FFS)를 제안합니다. FFS를 단순 디코딩, 빔 서치, 다수결 투표, 예산 강제 방식과 함께 네 가지 추론 모델(DeepSeek-R1, R1-Distill-Qwen-32B, QwQ-32B, Phi-4-Reasoning-Plus)과 네 가지 데이터셋(AIME24, AIME25-I, AIME25-II, GPQA Diamond)에서 평가했습니다. DeepSeek-R1을 사용한 FFS는 AIME 데이터셋에서 82.23%의 정확도를 달성했으며, 이는 DeepSeek-R1의 단독 정확도보다 15% 향상된 수치로, OpenAI의 o4-mini 성능에 거의 근접했습니다. 우리의 이론적 분석은 가장 짧은 추적 경로에서 멈추는 것이 정답을 얻을 가능성이 높은 이유를 설명하고, 조기 중단이 최적이 아닐 수 있는 조건을 규명합니다. FFS의 우아함과 단순성은 직관적인 TTS 전략이 놀라운 성능을 발휘할 수 있음을 보여주며, 추론 시점에서 단순한 접근법의 잠재력을 드러냅니다.

English

Test-time scaling (TTS), which involves dynamic allocation of compute during inference, offers a promising way to improve reasoning in large language models. While existing TTS methods work well, they often rely on long decoding paths or require a large number of samples to be generated, increasing the token usage and inference latency. We observe the surprising fact that for reasoning tasks, shorter traces are much more likely to be correct than longer ones. Motivated by this, we introduce First Finish Search (FFS), a training-free parallel decoding strategy that launches n independent samples and returns as soon as any one completes. We evaluate FFS alongside simple decoding, beam search, majority voting, and budget forcing on four reasoning models (DeepSeek-R1, R1-Distill-Qwen-32B, QwQ-32B and Phi-4-Reasoning-Plus) and across four datasets (AIME24, AIME25-I, AIME25-II and GPQA Diamond). With DeepSeek-R1, FFS achieves 82.23% accuracy on the AIME datasets, a 15% improvement over DeepSeek-R1's standalone accuracy, nearly matching OpenAI's o4-mini performance. Our theoretical analysis explains why stopping at the shortest trace is likely to yield a correct answer and identifies the conditions under which early stopping may be suboptimal. The elegance and simplicity of FFS demonstrate that straightforward TTS strategies can perform remarkably well, revealing the untapped potential of simple approaches at inference time.

퍼스트 피니시 서치: 대규모 언어 모델에서의 효율적인 테스트 타임 스케일링

First Finish Search: Efficient Test-Time Scaling in Large Language Models

초록

Support