何を、どのように、どこで、どれだけうまく？大規模言語モデルにおけるテストタイムスケーリングに関する調査

要旨

事前学習時代における計算リソース（データとパラメータ）のスケーリングへの熱意が徐々に薄れる中、テストタイムスケーリング（TTS）、別名「テストタイムコンピューティング」が注目を集める研究分野として浮上してきました。最近の研究では、TTSが大規模言語モデル（LLMs）の問題解決能力をさらに引き出し、数学やコーディングなどの専門的な推論タスクだけでなく、オープンエンドのQ&Aのような一般的なタスクにおいても重要なブレークスルーを可能にすることが示されています。しかし、この分野での最近の取り組みが爆発的に増加しているにもかかわらず、体系的な理解を提供する包括的な調査が急務となっています。このギャップを埋めるため、私たちはTTS研究の4つの核心的な次元（何をスケールするか、どのようにスケールするか、どこでスケールするか、どれだけうまくスケールするか）に沿って構造化された統一的な多次元フレームワークを提案します。この分類体系に基づき、手法、適用シナリオ、評価側面について広範なレビューを行い、TTSの広範な領域における個々の技術の独自の機能的な役割を強調する体系的な分解を提示します。この分析から、これまでのTTSの主要な発展軌跡を抽出し、実践的な展開のための実践的なガイドラインを提供します。さらに、いくつかの未解決の課題を特定し、さらなるスケーリング、技術の機能的エッセンスの明確化、より多くのタスクへの一般化、そしてより多くの属性化など、将来の有望な方向性についての洞察を提供します。

English

As enthusiasm for scaling computation (data and parameters) in the pretraining era gradually diminished, test-time scaling (TTS), also referred to as ``test-time computing'' has emerged as a prominent research focus. Recent studies demonstrate that TTS can further elicit the problem-solving capabilities of large language models (LLMs), enabling significant breakthroughs not only in specialized reasoning tasks, such as mathematics and coding, but also in general tasks like open-ended Q&A. However, despite the explosion of recent efforts in this area, there remains an urgent need for a comprehensive survey offering a systemic understanding. To fill this gap, we propose a unified, multidimensional framework structured along four core dimensions of TTS research: what to scale, how to scale, where to scale, and how well to scale. Building upon this taxonomy, we conduct an extensive review of methods, application scenarios, and assessment aspects, and present an organized decomposition that highlights the unique functional roles of individual techniques within the broader TTS landscape. From this analysis, we distill the major developmental trajectories of TTS to date and offer hands-on guidelines for practical deployment. Furthermore, we identify several open challenges and offer insights into promising future directions, including further scaling, clarifying the functional essence of techniques, generalizing to more tasks, and more attributions.

何を、どのように、どこで、どれだけうまく？大規模言語モデルにおけるテストタイムスケーリングに関する調査

What, How, Where, and How Well? A Survey on Test-Time Scaling in Large Language Models

要旨

Support