LLMエージェントのテスト時計算リソースのスケーリング

要旨

テスト時間計算のスケーリングは、大規模言語モデル（LLMs）の推論能力を向上させる上で顕著な成功を収めています。本研究では、テスト時間スケーリング手法を言語エージェントに適用し、その有効性がどの程度向上するかを初めて体系的に探求します。具体的には、以下の異なるテスト時間スケーリング戦略を探索します：(1) 並列サンプリングアルゴリズム、(2) 逐次修正戦略、(3) 検証器とマージ手法、(4) ロールアウトの多様化戦略。我々は、言語エージェントにテスト時間スケーリングを適用する際の異なる設計戦略の影響を慎重に分析し、以下の知見を得ました：1. テスト時間計算のスケーリングはエージェントの性能を向上させることができる。2. エージェントがいつ反省すべきかを知ることが重要である。3. 異なる検証および結果マージ手法の中では、リストワイズ法が最も優れている。4. 多様化されたロールアウトを増やすことは、エージェントのタスク性能にプラスの効果をもたらす。

English

Scaling test time compute has shown remarkable success in improving the reasoning abilities of large language models (LLMs). In this work, we conduct the first systematic exploration of applying test-time scaling methods to language agents and investigate the extent to which it improves their effectiveness. Specifically, we explore different test-time scaling strategies, including: (1) parallel sampling algorithms; (2) sequential revision strategies; (3) verifiers and merging methods; (4)strategies for diversifying rollouts.We carefully analyze and ablate the impact of different design strategies on applying test-time scaling on language agents, and have follow findings: 1. Scaling test time compute could improve the performance of agents. 2. Knowing when to reflect is important for agents. 3. Among different verification and result merging approaches, the list-wise method performs best. 4. Increasing diversified rollouts exerts a positive effect on the agent's task performance.

LLMエージェントのテスト時計算リソースのスケーリング

Scaling Test-time Compute for LLM Agents

要旨

Support