코코아벤치: 실전에서의 통합 디지털 에이전트 평가

초록

LLM 에이전트는 현재 소프트웨어 엔지니어링, 심층 리서치, GUI 자동화 등 다양한 분야에서 강력한 성능을 보여주며, 최근의 에이전트 스캐폴드와 모델들은 이러한 능력을 점차 통합된 시스템으로 결합하고 있습니다. 그러나 대부분의 평가는 여전히 이러한 능력을 개별적으로 테스트하여, 에이전트가 서로 다른 능력을 결합해야 하는 더 다양한 사용 사례에 대한 격차를 남기고 있습니다. 본 논문에서는 시각, 검색, 코딩 능력의 유연한 조합을 요구하는 인간이 설계한 장기 과제로 구성된 통합 디지털 에이전트 벤치마크인 CocoaBench을 소개합니다. 과제는 최종 출력물에 대한 지시문과 자동 평가 함수로만 명세되어, 다양한 에이전트 인프라에서 신뢰할 수 있고 확장 가능한 평가를 가능하게 합니다. 또한 모델 백본 간 통제된 비교를 위한 경량 공유 스캐폴드인 CocoaAgent도 제시합니다. 실험 결과, 현재 에이전트들은 CocoaBench에서 신뢰할 수 있는 성능에 한참 미치지 못하며, 평가된 최고 시스템도 45.1%의 성공률에 그치는 것으로 나타났습니다. 우리의 분석은 추론 및 계획, 도구 사용 및 실행, 시각적 기반 확보 분야에서 개선이 크게 필요함을 추가로 지적합니다.

English

LLM agents now perform strongly in software engineering, deep research, GUI automation, and various other applications, while recent agent scaffolds and models are increasingly integrating these capabilities into unified systems. Yet, most evaluations still test these capabilities in isolation, which leaves a gap for more diverse use cases that require agents to combine different capabilities. We introduce CocoaBench, a benchmark for unified digital agents built from human-designed, long-horizon tasks that require flexible composition of vision, search, and coding. Tasks are specified only by an instruction and an automatic evaluation function over the final output, enabling reliable and scalable evaluation across diverse agent infrastructures. We also present CocoaAgent, a lightweight shared scaffold for controlled comparison across model backbones. Experiments show that current agents remain far from reliable on CocoaBench, with the best evaluated system achieving only 45.1% success rate. Our analysis further points to substantial room for improvement in reasoning and planning, tool use and execution, and visual grounding.

코코아벤치: 실전에서의 통합 디지털 에이전트 평가

CocoaBench: Evaluating Unified Digital Agents in the Wild

초록

Support