WeaveBench: 하이브리드 인터페이스를 갖춘 컴퓨터 사용 에이전트를 위한 장기적 실세계 벤치마크

초록

컴퓨터 사용 에이전트(CUA)는 점차 시각적 데스크톱 제어, 명령줄 실행, 코드 편집, 브라우저 및 외부 도구를 결합한 런타임 환경에서 작동하고 있다. 그러나 기존 벤치마크는 이러한 인터페이스를 분리된 기능으로 평가하는 경우가 많아, 장기적 시간 범위에서의 교차 인터페이스 조율(long-horizon cross-interface orchestration)은 충분히 테스트되지 않고 있다. 이에 우리는 실제 사용자 요청과 공개적으로 검증 가능한 산출물에 기반하여, 8개의 실제 작업 영역에 걸친 114개 과제로 구성된 장기적 하이브리드 인터페이스 벤치마크인 WeaveBench를 제안한다. 각 과제는 에이전트가 단일 궤적(trajectory) 내에서 GUI 관찰/행동과 CLI/코드 작업을 결합하도록 요구한다. 우리는 이러한 과제를 최소한의 데스크톱 제어 플러그인으로 보강된 배포된 CLI 에이전트 런타임 환경 내의 실제 Ubuntu 데스크톱에서 평가한다. 또한, 전달물, 파일, 스크린샷, 로그 및 행동 흔적을 검사하고, 조작된 시각적 증거나 하드코딩된 지표와 같은 단축 행동(shortcut behaviors)을 탐지하는 보조 궤적 인식 평가자(trajectory-aware judge)를 제안한다. 최첨단 모델-런타임 조합에서 최고 PassRate는 41.2%에 불과하여, 해당 벤치마크가 아직 포화 상태와는 거리가 멀음을 보여준다. 궤적 인식 평가자는 결과만 평가하는 채점(outcome-only grading)이 에이전트 성능을 상당히 과대평가함을 추가로 밝혀낸다. 종합하면, WeaveBench는 CUA 평가에서의 중요한 격차를 드러내며, 에이전트가 장기적 실제 작업에서 GUI, CLI 및 코드 작업을 조율할 수 있는지 측정하기 위한 효과적인 테스트베드를 제공한다.

English

Computer-use agents (CUAs) increasingly operate in runtimes that combine visual desktop control, command-line execution, code editing, browsers, and external tools. Existing benchmarks, however, often evaluate these interfaces as separable capabilities, leaving long-horizon cross-interface orchestration under-tested. Thus, we introduce WeaveBench, a long-horizon hybrid-interface benchmark with 114 tasks across 8 real-world work domains, grounded in real user requests and publicly verifiable artifacts. Each task requires agents to combine GUI observations/actions with CLI/code operations within a single trajectory. We evaluate these tasks on a real Ubuntu desktop inside deployed CLI-agent runtimes, augmented with a minimal desktop-control plugin. We also propose a companion trajectory-aware judge that inspects deliverables, files, screenshots, logs, and action traces, while detecting shortcut behaviors such as fabricated visual evidence or hard-coded metrics. Across frontier model-runtime pairings, the best PassRate reaches only 41.2%, showing the benchmark remains far from saturated. The trajectory-aware judge further reveals that outcome-only grading substantially overestimates agent performance. Overall, WeaveBench exposes a critical gap in CUA evaluation and provides an effective testbed to measure whether agents can orchestrate GUI, CLI, and code operations across long-horizon real-world tasks.