CocoaBench：在真实环境中评估统一数字代理

摘要

当前，大型语言模型智能体在软件工程、深度研究、图形用户界面自动化等众多应用领域表现卓越，而近期的智能体框架与模型正日益将这些能力整合为统一系统。然而，大多数评估仍孤立测试这些能力，这导致对需要智能体融合多种能力的多样化应用场景存在评估空白。我们推出CocoaBench——一个面向统一数字智能体的基准测试，其通过人工设计的长期任务构建，要求智能体灵活组合视觉、搜索与编程能力。所有任务仅通过指令说明和基于最终输出的自动评估函数来定义，从而实现对不同智能体架构的可靠、可扩展评估。我们还提出CocoaAgent——一个轻量级共享框架，用于在不同模型骨干间进行受控比较。实验表明，当前智能体在CocoaBench上的可靠性仍显不足，表现最佳系统的成功率仅为45.1%。进一步分析指出，智能体在推理规划、工具使用执行及视觉基础理解等方面仍存在显著提升空间。

English

LLM agents now perform strongly in software engineering, deep research, GUI automation, and various other applications, while recent agent scaffolds and models are increasingly integrating these capabilities into unified systems. Yet, most evaluations still test these capabilities in isolation, which leaves a gap for more diverse use cases that require agents to combine different capabilities. We introduce CocoaBench, a benchmark for unified digital agents built from human-designed, long-horizon tasks that require flexible composition of vision, search, and coding. Tasks are specified only by an instruction and an automatic evaluation function over the final output, enabling reliable and scalable evaluation across diverse agent infrastructures. We also present CocoaAgent, a lightweight shared scaffold for controlled comparison across model backbones. Experiments show that current agents remain far from reliable on CocoaBench, with the best evaluated system achieving only 45.1% success rate. Our analysis further points to substantial room for improvement in reasoning and planning, tool use and execution, and visual grounding.

CocoaBench：在真实环境中评估统一数字代理

CocoaBench: Evaluating Unified Digital Agents in the Wild

摘要

Support