大規模言語モデルにおけるテスト時間計算量のスケーリング技術

要旨

テストタイムスケーリング（TTS）――推論時の計算資源の動的割り当て――は、大規模言語モデル（LLM）の推論能力を向上させる有望な方向性である。しかし、同一条件下での既知のTTS戦略の体系的な比較は不足しており、モデルタイプや問題の難易度が性能に与える影響も不明なままである。これらの課題に対処するため、我々は4つの推論データセットを用い、8つのオープンソースLLM（7Bから235Bパラメータ）によって生成された300億トークン以上にわたる、初の大規模なTTS研究を実施した。以下の3つの一貫した傾向を観察した：（1）普遍的に優位な単一のTTS戦略は存在しない；（2）推論モデルは、問題の難易度と思考過程の長さにわたって特徴的なトレース品質パターンを示し、短期視野型と長期視野型のカテゴリを形成する；（3）あるモデルタイプにおいて、最適なTTS性能は計算バジェットに対して単調にスケールする。これらの知見に基づき、問題の難易度、モデルタイプ、計算バジェットを考慮した最適なTTS戦略を選択する実用的なレシピを提供し、効果的な推論時スケーリングへの実践的指針を示す。

English

Test-time scaling (TTS) -- the dynamic allocation of compute during inference -- is a promising direction for improving reasoning in large language models (LLMs). However, a systematic comparison of well-known TTS strategies under identical conditions is missing, and the influence of model type and problem difficulty on performance remains unclear. To address these gaps, we conduct the first large-scale study of TTS, spanning over thirty billion tokens generated using eight open-source LLMs (7B to 235B parameters), across four reasoning datasets. We observe three consistent trends: (1) no single TTS strategy universally dominates; (2) reasoning models exhibit distinct trace-quality patterns across problem difficulty and trace length, forming short-horizon and long-horizon categories; and (3) for a given model type, the optimal TTS performance scales monotonically with compute budget. Based on these insights, we provide a practical recipe for selecting the best TTS strategy, considering problem difficulty, model type, and compute budget, providing a practical guide to effective inference-time scaling.

大規模言語モデルにおけるテスト時間計算量のスケーリング技術

The Art of Scaling Test-Time Compute for Large Language Models

要旨

Support