CocoaBench：評估現實環境中的統一數位代理

摘要

目前，大型語言模型代理在軟體工程、深度研究、圖形介面自動化等領域表現卓越，而近期推出的代理框架與模型正逐步將這些能力整合為統一系統。然而，多數評估仍孤立測試單一能力，這使得需要代理融合多種技能的多元應用場景缺乏相應評測標準。我們推出CocoaBench基準測試，該基準基於人類設計的長週期任務構建，要求統一數位代理靈活組合視覺、搜索與編程能力。任務僅透過指令描述和最終輸出的自動評估函數定義，可在不同代理架構下實現可靠且可擴展的評估。我們同時提出CocoaAgent——一個輕量級共享框架，用於在可控條件下比較不同模型核心的表現。實驗顯示，現有代理在CocoaBench上的可靠性仍顯不足，最佳測試系統成功率僅達45.1%。進一步分析指出，代理在推理規劃、工具使用與執行、視覺基礎理解等方面仍有巨大改進空間。

English

LLM agents now perform strongly in software engineering, deep research, GUI automation, and various other applications, while recent agent scaffolds and models are increasingly integrating these capabilities into unified systems. Yet, most evaluations still test these capabilities in isolation, which leaves a gap for more diverse use cases that require agents to combine different capabilities. We introduce CocoaBench, a benchmark for unified digital agents built from human-designed, long-horizon tasks that require flexible composition of vision, search, and coding. Tasks are specified only by an instruction and an automatic evaluation function over the final output, enabling reliable and scalable evaluation across diverse agent infrastructures. We also present CocoaAgent, a lightweight shared scaffold for controlled comparison across model backbones. Experiments show that current agents remain far from reliable on CocoaBench, with the best evaluated system achieving only 45.1% success rate. Our analysis further points to substantial room for improvement in reasoning and planning, tool use and execution, and visual grounding.

CocoaBench：評估現實環境中的統一數位代理

CocoaBench: Evaluating Unified Digital Agents in the Wild

摘要

Support