CocoaBench：実世界における統合デジタルエージェントの評価

要旨

現在、LLMエージェントはソフトウェア工学、深層リサーチ、GUI自動化など多様な応用分野で高い性能を発揮しており、近年のエージェント基盤やモデルはこれらの能力を統合したシステムへと発展しつつあります。しかし、大半の評価は依然として個別の能力を分離してテストしており、エージェントが複数の能力を組み合わせることを要求する多様なユースケースに対する評価が不足しています。本論文では、視覚処理、検索、コーディングを柔軟に組み合わせることを必要とする、人間が設計した長期的タスクから構成される統合デジタルエージェント向けベンチマーク「CocoaBench」を提案します。タスクは指示文と最終出力に対する自動評価関数のみで定義されるため、多様なエージェント基盤において信頼性が高く拡張可能な評価が可能です。さらに、モデル基盤間の制御された比較を可能とする軽量共有基盤「CocoaAgent」も提示します。実験結果から、現状のエージェントはCocoaBenchにおいて信頼性が大幅に不足しており、最高性能システムでも成功率45.1%に留まることが示されました。分析により、推論と計画、ツール利用と実行、視覚的接地の各領域において、改善の余地が大きいことも明らかになりました。

English

LLM agents now perform strongly in software engineering, deep research, GUI automation, and various other applications, while recent agent scaffolds and models are increasingly integrating these capabilities into unified systems. Yet, most evaluations still test these capabilities in isolation, which leaves a gap for more diverse use cases that require agents to combine different capabilities. We introduce CocoaBench, a benchmark for unified digital agents built from human-designed, long-horizon tasks that require flexible composition of vision, search, and coding. Tasks are specified only by an instruction and an automatic evaluation function over the final output, enabling reliable and scalable evaluation across diverse agent infrastructures. We also present CocoaAgent, a lightweight shared scaffold for controlled comparison across model backbones. Experiments show that current agents remain far from reliable on CocoaBench, with the best evaluated system achieving only 45.1% success rate. Our analysis further points to substantial room for improvement in reasoning and planning, tool use and execution, and visual grounding.

CocoaBench：実世界における統合デジタルエージェントの評価

CocoaBench: Evaluating Unified Digital Agents in the Wild

要旨

Support