AgentTTS: 複雑なタスクにおけるテスト時計算最適化スケーリング戦略のための大規模言語モデルエージェント

要旨

テストタイムスケーリング（TTS）は、推論時に追加の計算リソースを割り当てることで大規模言語モデル（LLM）の性能を向上させる。しかし、既存の研究は主に単一ステージタスクにおけるTTSを調査しており、現実世界の問題の多くは、各サブタスクが特定の能力を必要とする異種のサブタスクのシーケンスで構成される多段階複雑タスクである。そこで、我々は多段階複雑タスクにおけるテストタイム計算最適スケーリングという新たな問題を研究し、適切なモデルを選択し、各サブタスクに予算を割り当てることで全体の性能を最大化することを目指す。多段階タスクにおけるTTSは、二つの基本的な課題を導入する：（i）モデルと予算割り当ての組み合わせ探索空間と、推論の高コストが、力任せの探索を非現実的にする。（ii）サブタスク間での最適なモデルと予算割り当ては相互依存しており、計算最適探索の複雑さを増す。このギャップを埋めるため、我々は6つのデータセットにわたる4つのタスクで広範なパイロット実験を行い、多段階複雑タスクにおけるLLMの挙動を特徴づける3つの経験的知見を導出した。これらの知見に基づき、我々はAgentTTSを提案する。これは、実行環境との反復的なフィードバック駆動型相互作用を通じて計算最適割り当てを自律的に探索するLLMエージェントベースのフレームワークである。実験結果は、AgentTTSが従来のベースラインや他のLLMベースのベースラインを探索効率で大幅に上回り、トレーニングセットサイズの変化に対する堅牢性と解釈可能性が向上することを示している。

English

Test-time scaling (TTS) enhances the performance of large language models (LLMs) by allocating additional compute resources during inference. However, existing research primarily investigates TTS in single-stage tasks; while many real-world problems are multi-stage complex tasks, composed of a sequence of heterogeneous subtasks with each subtask requires LLM of specific capability. Therefore, we study a novel problem: the test-time compute-optimal scaling in multi-stage complex tasks, aiming to select suitable models and allocate budgets per subtask to maximize overall performance. TTS in multi-stage tasks introduces two fundamental challenges: (i) The combinatorial search space of model and budget allocations, combined with the high cost of inference, makes brute-force search impractical. (ii) The optimal model and budget allocations across subtasks are interdependent, increasing the complexity of the compute-optimal search. To address this gap, we conduct extensive pilot experiments on four tasks across six datasets, deriving three empirical insights characterizing the behavior of LLMs in multi-stage complex tasks. Informed by these insights, we propose AgentTTS, an LLM-agent-based framework that autonomously searches for compute-optimal allocations through iterative feedback-driven interactions with the execution environment. Experimental results demonstrate that AgentTTS significantly outperforms traditional and other LLM-based baselines in search efficiency, and shows improved robustness to varying training set sizes and enhanced interpretability.

AgentTTS: 複雑なタスクにおけるテスト時計算最適化スケーリング戦略のための大規模言語モデルエージェント

AgentTTS: Large Language Model Agent for Test-time Compute-optimal Scaling Strategy in Complex Tasks

要旨

Support