在测试时最优地扩展LLM计算量可能比扩展模型参数更有效。

摘要

通过使用更多的测试时间计算来使LLMs改善其输出是构建能够在开放式自然语言上运行的普遍自我改进代理的关键步骤。本文研究了LLMs推理时间计算的扩展，重点关注回答以下问题：如果允许LLM使用固定但非平凡量的推理时间计算，它在具有挑战性提示时可以提高多少性能？回答这个问题不仅对LLMs的可实现性能有影响，还对LLM预训练的未来以及如何权衡推理时间和预训练计算有影响。尽管这很重要，但很少有研究尝试理解各种测试时间推理方法的扩展行为。此外，当前工作在很大程度上为一些策略提供了负面结果。在这项工作中，我们分析了两种主要机制来扩展测试时间计算：（1）针对密集的基于过程的验证器奖励模型进行搜索；以及（2）在测试时间根据提示自适应地更新模型对响应的分布。我们发现，在这两种情况下，不同方法扩展测试时间计算的有效性在很大程度上取决于提示的难度。这一观察结果促使应用“计算最优”扩展策略，该策略可最有效地根据提示自适应地分配测试时间计算。使用这种计算最优策略，我们可以将测试时间计算的效率提高超过4倍，与最佳N基线相比。此外，在FLOPs匹配评估中，我们发现在较小基础模型取得某种程度的非平凡成功率的问题上，测试时间计算可以用来胜过14倍大的模型。

English

Enabling LLMs to improve their outputs by using more test-time computation is a critical step towards building generally self-improving agents that can operate on open-ended natural language. In this paper, we study the scaling of inference-time computation in LLMs, with a focus on answering the question: if an LLM is allowed to use a fixed but non-trivial amount of inference-time compute, how much can it improve its performance on a challenging prompt? Answering this question has implications not only on the achievable performance of LLMs, but also on the future of LLM pretraining and how one should tradeoff inference-time and pre-training compute. Despite its importance, little research attempted to understand the scaling behaviors of various test-time inference methods. Moreover, current work largely provides negative results for a number of these strategies. In this work, we analyze two primary mechanisms to scale test-time computation: (1) searching against dense, process-based verifier reward models; and (2) updating the model's distribution over a response adaptively, given the prompt at test time. We find that in both cases, the effectiveness of different approaches to scaling test-time compute critically varies depending on the difficulty of the prompt. This observation motivates applying a "compute-optimal" scaling strategy, which acts to most effectively allocate test-time compute adaptively per prompt. Using this compute-optimal strategy, we can improve the efficiency of test-time compute scaling by more than 4x compared to a best-of-N baseline. Additionally, in a FLOPs-matched evaluation, we find that on problems where a smaller base model attains somewhat non-trivial success rates, test-time compute can be used to outperform a 14x larger model.

在测试时最优地扩展LLM计算量可能比扩展模型参数更有效。

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

摘要

Support