テスト時計算量最適スケーリングの一般化と最適化可能グラフとしての定式化（注：このタイトルは、計算リソースの割り当てを動的最適化する新しいパラダイムを提案する研究内容を示唆しています。テスト時の計算コストをグラフ構造でモデル化し、スケーリング則を一般化するアプローチを表しています。）

要旨

テストタイムスケーリング（TTS）は、推論時に追加の計算リソースを割り当てることで大規模言語モデル（LLM）を改善する手法であり、通常は並列・逐次・ハイブリッドスケーリングによって実現されます。しかし、従来の研究では固定的な協調アーキテクチャ（トポロジーなど）と単一モデル使用が前提とされることが多く、最適なアーキテクチャとモデル組み合わせがタスクによって変化する点が看過されてきました。そこで本研究では、固定予算下でTTSにおける計算最適なモデル組み合わせとアーキテクチャを探索する新たな問題に着目します。これを、ノードが役割とLLMモデルの割当を符号化し、エッジが情報フローを捕捉するマルチLLM協調グラフとして定式化します。この問題は、(i)組み合わせ爆発を起こす探索空間の膨大さ、(ii)タスク特有の要件に応じた設計必要性という二つの難題を抱えています。これらの課題に対処するため、問題を確率的グラフ最適化として再定式化し、予備実験を通じてTTS協調グラフに関する三つの実証的知見を導出します。これらの知見に基づき、我々はAgent-REINFORCEを提案します。これは、サンプリング-勾配-更新のパイプラインをサンプリング-フィードバック-更新に対応付け、フィードバックをテキスト勾配として確率的グラフを更新するLLMエージェント拡張フレームワークであり、最適なマルチLLM協調グラフを効率的に探索します。実験結果では、Agent-REINFORCEが従来手法及びLLMベースラインをサンプル効率と探索性能の両面で上回り、精度と推論遅延の複合目標下で最適なグラフを効果的に同定できることを示します。

English

Test-Time Scaling (TTS) improves large language models (LLMs) by allocating additional computation during inference, typically through parallel, sequential, or hybrid scaling. However, prior studies often assume fixed collaboration architectures (e.g., topologies) and single-model usage, overlooking that optimal architectures and model combinations can vary across tasks. Therefore, we study the novel problem of searching for compute-optimal model combinations and architectures in TTS under a fixed budget. We formalize it as a multi-LLM collaboration graph, where nodes encode roles and LLM model assignments, and edges capture information flow. This problem is challenging because (i) the combinatorial search space is prohibitively large, and (ii) task-specific requirements demand tailored designs. To address these, we reformulate the problem as probabilistic graph optimization and, through pilot experiments, derive three empirical insights into TTS collaboration graphs. Guided by these insights, we propose Agent-REINFORCE, an LLM-agent-augmented framework that mirrors the REINFORCE pipeline by mapping sampling-gradient-update to sampling-feedback-update, where feedback serves as a textual gradient to update the probabilistic graph and efficiently search for optimal multi-LLM collaboration graphs. Experiments show that Agent-REINFORCE outperforms both traditional and LLM-based baselines in sample efficiency and search performance, and effectively identifies optimal graphs under joint objectives of accuracy and inference latency.

Generalizing Test-time Compute-optimal Scaling as an Optimizable Graph

要旨

Support