大規模言語モデルのテスト時計算リソースを最適にスケーリングすることは、モデルパラメータをスケーリングするよりも効果的である場合がある

要旨

テストタイム計算量を活用してLLMの出力を改善することは、オープンエンドな自然言語上で動作する一般的な自己改善エージェントを構築するための重要なステップです。本論文では、LLMにおける推論時計算量のスケーリングについて研究し、次の問いに焦点を当てます：もしLLMが固定されたが非自明な量の推論時計算リソースを使用できる場合、難しいプロンプトに対する性能をどれだけ改善できるか？この問いに答えることは、LLMの達成可能な性能だけでなく、LLMの事前学習の未来や、推論時計算量と事前学習計算量のトレードオフをどのようにすべきかについても示唆を与えます。その重要性にもかかわらず、さまざまなテストタイム推論手法のスケーリング挙動を理解しようとする研究はほとんどありません。さらに、現在の研究では、これらの戦略の多くに対して否定的な結果が主に報告されています。本研究では、テストタイム計算量をスケーリングするための2つの主要なメカニズムを分析します：(1) 密なプロセスベースの検証器報酬モデルに対する探索、(2) テスト時に与えられたプロンプトに基づいて、モデルの応答分布を適応的に更新すること。両ケースにおいて、テストタイム計算量をスケーリングするための異なるアプローチの有効性は、プロンプトの難易度に応じて大きく異なることがわかりました。この観察は、「計算最適」なスケーリング戦略を適用する動機付けとなります。この戦略は、プロンプトごとにテストタイム計算リソースを最も効果的に割り当てることを目指します。この計算最適戦略を使用することで、ベストオブNベースラインと比較して、テストタイム計算量のスケーリング効率を4倍以上向上させることができます。さらに、FLOPsを一致させた評価では、より小さなベースモデルがある程度の非自明な成功率を達成する問題において、テストタイム計算量を活用することで、14倍大きなモデルを上回ることができることがわかりました。

English

Enabling LLMs to improve their outputs by using more test-time computation is a critical step towards building generally self-improving agents that can operate on open-ended natural language. In this paper, we study the scaling of inference-time computation in LLMs, with a focus on answering the question: if an LLM is allowed to use a fixed but non-trivial amount of inference-time compute, how much can it improve its performance on a challenging prompt? Answering this question has implications not only on the achievable performance of LLMs, but also on the future of LLM pretraining and how one should tradeoff inference-time and pre-training compute. Despite its importance, little research attempted to understand the scaling behaviors of various test-time inference methods. Moreover, current work largely provides negative results for a number of these strategies. In this work, we analyze two primary mechanisms to scale test-time computation: (1) searching against dense, process-based verifier reward models; and (2) updating the model's distribution over a response adaptively, given the prompt at test time. We find that in both cases, the effectiveness of different approaches to scaling test-time compute critically varies depending on the difficulty of the prompt. This observation motivates applying a "compute-optimal" scaling strategy, which acts to most effectively allocate test-time compute adaptively per prompt. Using this compute-optimal strategy, we can improve the efficiency of test-time compute scaling by more than 4x compared to a best-of-N baseline. Additionally, in a FLOPs-matched evaluation, we find that on problems where a smaller base model attains somewhat non-trivial success rates, test-time compute can be used to outperform a 14x larger model.

大規模言語モデルのテスト時計算リソースを最適にスケーリングすることは、モデルパラメータをスケーリングするよりも効果的である場合がある

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

要旨

Summary

Support

Support