一般大規模言語モデルエージェントのベンチマークテスト時スケーリング

要旨

LLMエージェントは、広範なユーザー要求を解決可能な汎用システムとして機能することがますます期待されている。既存のベンチマークは専門的なエージェントの開発に向けたドメイン特化環境に焦点を当てるが、汎用エージェントの評価には、統一された環境内で複数のスキルとツールを横断して動作する能力が試される、より現実的な設定が必要である。本論文では、検索、コーディング、推論、ツール利用の領域にわたって汎用LLMエージェントを評価する統一フレームワークを提供するベンチマーク、General AgentBenchを提案する。General AgentBenchを用いて、逐次的スケーリング（反復的インタラクション）と並列的スケーリング（複数軌道のサンプリング）におけるテスト時スケーリング挙動を系統的に調査する。10種類の主要なLLMエージェントの評価により、ドメイン特化評価からこの汎用エージェント設定に移行した際に、性能が大幅に低下することが明らかとなった。さらに、二つの根本的制約——逐次的スケーリングにおける文脈上限と並列的スケーリングにおける検証ギャップ——により、実際にはいずれのスケーリング手法も有効な性能向上をもたらさないことを見出した。コードはhttps://github.com/cxcscmu/General-AgentBench で公開されている。

English

LLM agents are increasingly expected to function as general-purpose systems capable of resolving open-ended user requests. While existing benchmarks focus on domain-aware environments for developing specialized agents, evaluating general-purpose agents requires more realistic settings that challenge them to operate across multiple skills and tools within a unified environment. We introduce General AgentBench, a benchmark that provides such a unified framework for evaluating general LLM agents across search, coding, reasoning, and tool-use domains. Using General AgentBench, we systematically study test-time scaling behaviors under sequential scaling (iterative interaction) and parallel scaling (sampling multiple trajectories). Evaluation of ten leading LLM agents reveals a substantial performance degradation when moving from domain-specific evaluations to this general-agent setting. Moreover, we find that neither scaling methodology yields effective performance improvements in practice, due to two fundamental limitations: context ceiling in sequential scaling and verification gap in parallel scaling. Code is publicly available at https://github.com/cxcscmu/General-AgentBench.

一般大規模言語モデルエージェントのベンチマークテスト時スケーリング

Benchmark Test-Time Scaling of General LLM Agents

要旨

Support