いつ解決し、いつ検証するか：計算最適化された問題解決とLLM推論のための生成的検証

要旨

テスト時の計算リソースのスケーリングは、大規模言語モデル（LLM）の推論能力を向上させるための重要な戦略として注目を集めており、特に数学的問題解決などのタスクにおいてその効果が顕著です。従来のアプローチである「自己一貫性（Self-Consistency, SC）」は、問題に対して複数の解を生成し、多数決によって最も一般的な答えを選択します。また、別の一般的な方法として、各解を報酬モデル（検証器）でスコア付けし、最良の解を選ぶ手法があります。最近の「生成的報酬モデル（Generative Reward Models, GenRM）」の進展により、検証を次のトークン予測タスクとして再定義し、新しい軸に沿った推論時のスケーリングが可能になりました。具体的には、GenRMは各解をスコア付けするために複数の検証用の思考連鎖（chain-of-thought）を生成します。限られた推論予算の下では、これにより根本的なトレードオフが生じます：予算をSCによる解のスケーリングに費やすべきか、それとも解の生成数を減らしてGenRMによる検証にリソースを割り当てるべきか？この問題に対処するため、固定された推論予算の下でGenRMとSCを比較評価しました。興味深いことに、多様なモデルやデータセットにおいて、ほとんどの実用的な推論予算では、SCがGenRMよりも計算効率が高いことがわかりました。例えば、GenRMがSCと同等の性能を発揮するには最大8倍の推論計算リソースを消費し、それを上回るためにはさらに多くの計算リソースが必要です。さらに、GenRMパラダイムにおける推論スケーリング則を導出し、計算最適な推論では、検証回数のスケーリングよりも解の生成のスケーリングをより積極的に行うことが有利であることを明らかにしました。本研究は、解の生成と検証のバランスを取ることで、テスト時のスケーリングを最適化するための実践的な指針を提供します。コードはhttps://github.com/nishadsinghi/sc-genrm-scalingで公開されています。

English

Scaling test-time compute has emerged as a key strategy for enhancing the reasoning capabilities of large language models (LLMs), particularly in tasks like mathematical problem-solving. A traditional approach, Self-Consistency (SC), generates multiple solutions to a problem and selects the most common answer via majority voting. Another common method involves scoring each solution with a reward model (verifier) and choosing the best one. Recent advancements in Generative Reward Models (GenRM) reframe verification as a next-token prediction task, enabling inference-time scaling along a new axis. Specifically, GenRM generates multiple verification chains-of-thought to score each solution. Under a limited inference budget, this introduces a fundamental trade-off: should you spend the budget on scaling solutions via SC or generate fewer solutions and allocate compute to verification via GenRM? To address this, we evaluate GenRM against SC under a fixed inference budget. Interestingly, we find that SC is more compute-efficient than GenRM for most practical inference budgets across diverse models and datasets. For instance, GenRM first matches SC after consuming up to 8x the inference compute and requires significantly more compute to outperform it. Furthermore, we derive inference scaling laws for the GenRM paradigm, revealing that compute-optimal inference favors scaling solution generation more aggressively than scaling the number of verifications. Our work provides practical guidance on optimizing test-time scaling by balancing solution generation and verification. The code is available at https://github.com/nishadsinghi/sc-genrm-scaling.

いつ解決し、いつ検証するか：計算最適化された問題解決とLLM推論のための生成的検証

When To Solve, When To Verify: Compute-Optimal Problem Solving and Generative Verification for LLM Reasoning

要旨

Support