在測試時間計算上，最佳地調整LLM的規模可能比調整模型參數更有效。

摘要

讓LLMs透過更多的測試時間計算來改善其輸出是建立通常自我改進代理人的重要一步，這樣的代理人可以在開放式自然語言上運作。本文研究LLMs在推理時間計算的擴展，重點是回答以下問題：如果一個LLM被允許使用固定但非微不足道的推理時間計算量，它在一個具有挑戰性的提示上可以提高多少性能？回答這個問題不僅對LLMs的可達性性能有影響，還對LLM預訓練的未來以及如何權衡推理時間和預訓練計算有所影響。儘管這很重要，但很少有研究試圖了解各種測試時間推理方法的擴展行為。此外，目前的工作在許多這些策略上主要提供了負面結果。在這項工作中，我們分析了兩種主要機制來擴展測試時間計算：（1）針對密集的基於過程的驗證器獎勵模型進行搜索；以及（2）根據測試時間的提示自適應地更新模型對回應的分佈。我們發現，在這兩種情況下，不同方法對於擴展測試時間計算的有效性在很大程度上取決於提示的困難程度。這一觀察促使應用“計算最佳化”擴展策略，該策略可根據提示自適應地分配測試時間計算。使用這種計算最佳化策略，我們可以將測試時間計算的效率提高超過4倍，與最佳N基線相比。此外，在FLOPs匹配的評估中，我們發現在較小基本模型取得某種非微不足道成功率的問題上，測試時間計算可以用來勝過14倍大的模型。

English

Enabling LLMs to improve their outputs by using more test-time computation is a critical step towards building generally self-improving agents that can operate on open-ended natural language. In this paper, we study the scaling of inference-time computation in LLMs, with a focus on answering the question: if an LLM is allowed to use a fixed but non-trivial amount of inference-time compute, how much can it improve its performance on a challenging prompt? Answering this question has implications not only on the achievable performance of LLMs, but also on the future of LLM pretraining and how one should tradeoff inference-time and pre-training compute. Despite its importance, little research attempted to understand the scaling behaviors of various test-time inference methods. Moreover, current work largely provides negative results for a number of these strategies. In this work, we analyze two primary mechanisms to scale test-time computation: (1) searching against dense, process-based verifier reward models; and (2) updating the model's distribution over a response adaptively, given the prompt at test time. We find that in both cases, the effectiveness of different approaches to scaling test-time compute critically varies depending on the difficulty of the prompt. This observation motivates applying a "compute-optimal" scaling strategy, which acts to most effectively allocate test-time compute adaptively per prompt. Using this compute-optimal strategy, we can improve the efficiency of test-time compute scaling by more than 4x compared to a best-of-N baseline. Additionally, in a FLOPs-matched evaluation, we find that on problems where a smaller base model attains somewhat non-trivial success rates, test-time compute can be used to outperform a 14x larger model.

在測試時間計算上，最佳地調整LLM的規模可能比調整模型參數更有效。

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

摘要

Support