LLM 테스트 시간 계산을 최적화하는 것이 모델 매개변수를 확장하는 것보다 더 효과적일 수 있습니다.

초록

LLM(Large Language Models)가 더 많은 테스트 시간 계산을 사용하여 출력을 향상시킬 수 있도록 하는 것은 오픈 엔드 자연어에서 작동할 수 있는 일반적으로 자가 향상 에이전트를 구축하기 위한 중요한 단계입니다. 본 논문에서는 LLMs에서 추론 시간 계산의 스케일링을 연구하며, 다음 질문에 초점을 맞춥니다: LLM이 고정된 비트 트리비아량의 추론 시간 계산을 사용할 수 있다면, 어려운 프롬프트에서 성능을 얼마나 향상시킬 수 있을까요? 이 질문에 대한 답변은 LLMs의 달성 가능한 성능 뿐만 아니라 LLM 사전 훈련의 미래 및 추론 시간과 사전 훈련 계산의 교환에도 영향을 미칩니다. 그 중요성에도 불구하고, 다양한 테스트 시간 추론 방법의 스케일링 행동을 이해하려는 연구는 거의 없었습니다. 더구나 현재의 연구는 이러한 전략 중 일부에 대해 부정적인 결과를 주로 제공합니다. 본 연구에서는 테스트 시간 계산을 확장하는 두 가지 주요 메커니즘을 분석합니다: (1) 밀집된 프로세스 기반 확인자 보상 모델에 대한 탐색; 그리고 (2) 테스트 시간에 프롬프트를 고려하여 모델의 분포를 적응적으로 업데이트하는 것. 우리는 두 경우 모두 다른 접근 방식의 효과가 프롬프트의 어려움에 따라 중대하게 다르다는 것을 발견했습니다. 이 관찰은 테스트 시간 계산을 가장 효과적으로 프롬프트 당 적응적으로 할당하는 "계산 최적" 스케일링 전략을 적용하는 것을 동기부여합니다. 이 계산 최적 전략을 사용하면, 최고 N 베이스라인과 비교하여 테스트 시간 계산의 효율성을 4배 이상 향상시킬 수 있습니다. 또한 FLOPs(Floating Point Operations per Second) 매칭 평가에서, 더 작은 베이스 모델이 어느 정도의 비트 트리비아 성공률을 달성하는 문제에서, 테스트 시간 계산을 사용하여 14배 큰 모델을 능가할 수 있음을 발견했습니다.

English

Enabling LLMs to improve their outputs by using more test-time computation is a critical step towards building generally self-improving agents that can operate on open-ended natural language. In this paper, we study the scaling of inference-time computation in LLMs, with a focus on answering the question: if an LLM is allowed to use a fixed but non-trivial amount of inference-time compute, how much can it improve its performance on a challenging prompt? Answering this question has implications not only on the achievable performance of LLMs, but also on the future of LLM pretraining and how one should tradeoff inference-time and pre-training compute. Despite its importance, little research attempted to understand the scaling behaviors of various test-time inference methods. Moreover, current work largely provides negative results for a number of these strategies. In this work, we analyze two primary mechanisms to scale test-time computation: (1) searching against dense, process-based verifier reward models; and (2) updating the model's distribution over a response adaptively, given the prompt at test time. We find that in both cases, the effectiveness of different approaches to scaling test-time compute critically varies depending on the difficulty of the prompt. This observation motivates applying a "compute-optimal" scaling strategy, which acts to most effectively allocate test-time compute adaptively per prompt. Using this compute-optimal strategy, we can improve the efficiency of test-time compute scaling by more than 4x compared to a best-of-N baseline. Additionally, in a FLOPs-matched evaluation, we find that on problems where a smaller base model attains somewhat non-trivial success rates, test-time compute can be used to outperform a 14x larger model.

LLM 테스트 시간 계산을 최적화하는 것이 모델 매개변수를 확장하는 것보다 더 효과적일 수 있습니다.

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

초록

Summary

Support

Support