大規模言語モデルのテスト時計算のための単純で証明可能なスケーリング則

要旨

大規模言語モデル（LLM）のテスト時計算において証明可能なスケーリング則を享受する一般的な2段階アルゴリズムを提案します。入力問題が与えられると、提案されたアルゴリズムはまずN個の候補解を生成し、その後、各候補同士がK回比較され、勝者のみが次のラウンドに進む複数ラウンドのノックアウトトーナメントを通じて最良の解を選択します。最小限の実装では、両段階をブラックボックスLLMのみを用いて実行し、外部検証者や報酬モデルなどは不要であり、入力問題の解決には合計N回（K+1）回の高度に並列化可能なLLM呼び出しが必要です。生成された候補解が確率p_{gen} > 0で正しいとし、正しい解と不正解の解の比較がp_{comp} > 0.5の確率で正しい勝者を特定する（つまり、ランダムな推測よりも優れている）と仮定すると、提案されたアルゴリズムの失敗確率がNとKに関して指数関数的に減衰することを理論的に証明します。提案されたアルゴリズムの失敗確率は次の式で表されます：$P(最終出力が不正確である) \leq (1 - p_{gen})^N + \lceil \log_2 N \rceil e^{-2 K (p_{comp} - 0.5)^2}.$ 挑戦的なMMLU-Proベンチマークにおける経験的結果は、技術的仮定と提案されたアルゴリズムの効果、およびテスト時計算のスケーリングアップから得られる利点を検証しています。

English

We propose a general two-stage algorithm that enjoys a provable scaling law for the test-time compute of large language models (LLMs). Given an input problem, the proposed algorithm first generates N candidate solutions, and then chooses the best one via a multiple-round knockout tournament where each pair of candidates are compared for K times and only the winners move on to the next round. In a minimalistic implementation, both stages can be executed with a black-box LLM alone and nothing else (e.g., no external verifier or reward model), and a total of N times (K + 1) highly parallelizable LLM calls are needed for solving an input problem. Assuming that a generated candidate solution is correct with probability p_{gen} > 0 and a comparison between a pair of correct and incorrect solutions identifies the right winner with probability p_{comp} > 0.5 (i.e., better than a random guess), we prove theoretically that the failure probability of the proposed algorithm decays to zero exponentially with respect to N and K: $P(final output is incorrect) le (1 - p_{gen})^N + lceil log_2 N rceil e^{-2 K (p_{comp} - 0.5)^2}.$ Our empirical results with the challenging MMLU-Pro benchmark validate the technical assumptions, as well as the efficacy of the proposed algorithm and the gains from scaling up its test-time compute.

大規模言語モデルのテスト時計算のための単純で証明可能なスケーリング則

A Simple and Provable Scaling Law for the Test-Time Compute of Large Language Models

要旨

Support