計算効率の良いテスト時スケーリングのための最適な検証粒度の再考

要旨

テストタイムスケーリング（TTS）は、大規模言語モデル（LLM）の推論能力を向上させるのに有効であることが証明されています。検証はTTSにおいて重要な役割を果たし、（1）推論性能と（2）計算効率の両方に同時に影響を与えます。これは、検証の品質と計算コストによるものです。本研究では、検証の従来のパラダイムに挑戦し、検証の粒度（つまり、生成中に検証器がどの程度頻繁に呼び出されるか、最終出力や個々の生成ステップのみを検証することを超えて）が与える影響を体系的に調査する初めての試みを行います。この目的のために、可変粒度探索（VG-Search）を導入します。これは、調整可能な粒度パラメータgを介してビームサーチとBest-of-Nサンプリングを一般化する統一アルゴリズムです。さまざまな計算予算、生成器-検証器構成、およびタスク属性の下でのVG-Searchの広範な実験により、gを動的に選択することで計算効率とスケーリング挙動を改善できることが明らかになりました。これらの知見に基づいて、ビームサーチに対して最大3.1%、Best-of-Nに対して最大3.6%の精度向上を達成し、FLOPsを52%以上削減する適応型VG-Search戦略を提案します。今後の研究を支援するため、コードをオープンソース化します。

English

Test-time scaling (TTS) has proven effective in enhancing the reasoning capabilities of large language models (LLMs). Verification plays a key role in TTS, simultaneously influencing (1) reasoning performance and (2) compute efficiency, due to the quality and computational cost of verification. In this work, we challenge the conventional paradigms of verification, and make the first attempt toward systematically investigating the impact of verification granularity-that is, how frequently the verifier is invoked during generation, beyond verifying only the final output or individual generation steps. To this end, we introduce Variable Granularity Search (VG-Search), a unified algorithm that generalizes beam search and Best-of-N sampling via a tunable granularity parameter g. Extensive experiments with VG-Search under varying compute budgets, generator-verifier configurations, and task attributes reveal that dynamically selecting g can improve the compute efficiency and scaling behavior. Building on these findings, we propose adaptive VG-Search strategies that achieve accuracy gains of up to 3.1\% over Beam Search and 3.6\% over Best-of-N, while reducing FLOPs by over 52\%. We will open-source the code to support future research.

計算効率の良いテスト時スケーリングのための最適な検証粒度の再考

Rethinking Optimal Verification Granularity for Compute-Efficient Test-Time Scaling

要旨

Support