다중 에이전트 컴퓨터 활용

초록

오늘날 컴퓨터 사용 에이전트(CUA)는 주로 단일 직렬 에이전트로 배포된다. 이러한 설정은 작업 분해, 병렬 실행 및 새로운 정보에 기반한 일관된 재계획의 이점을 얻을 수 있는 복잡한 장기적 과제에 최적이 아니다. 본 논문에서는 다중 에이전트 컴퓨터 사용(MACU) 시스템을 평가하고 구축하는 방향으로 전환해야 한다고 주장한다. 계획과 병렬 실행을 강조하는 이 시스템은 단일 에이전트 CUA의 많은 단점을 완화한다. 우리는 관리자 모델이 컴퓨터 사용 작업을 방향성 비순환 그래프(DAG)로 분해하고 하위 에이전트에 관련 종속성과 목표를 인코딩하는 일반적인 다중 에이전트 설정을 제안한다. 각 반복에서 관리자는 DAG의 준비 경계에 있는 노드를 실행하기 위해 병렬 CUA 하위 에이전트를 파견하고, 하위 에이전트로부터 새로운 발견이 도착하면 DAG를 지속적으로 수정한다(노드 추가, 취소 또는 다시 작성). 이 설계는 컴퓨터 사용의 부분 관측 가능 환경을 일급 도전 과제로 취급한다: 하류 에이전트가 다시 관측하지 못할 수 있는 정보는 관리자와 DAG 구조를 통해 유지되어 전달된다. 우리는 MACU가 데스크톱(OSWorld) 및 웹 내비게이션(Online-Mind2Web, WebTailBench, Odysseys) 벤치마크에서 강력한 단일 에이전트 기준선 대비 3.4-25.5%의 일관된 성능 향상을 보이며, 더 유리한 테스트 시 스케일링을 나타내고, 단일 에이전트 CUA가 정체되는 복잡한 장기적 과제를 해결함을 입증한다. 장기적 웹 내비게이션 벤치마크인 Odysseys에서 MACU는 평균 작업 완료 벽시계 시간을 약 1.5배 개선하여 전통적으로 느린 CUA 파이프라인의 속도를 높이는 효율성을 보여준다. 우리의 발견은 다중 에이전트 협력이 컴퓨터 사용 에이전트를 더 오랫동안 생산적으로, 더 효과적으로 작동하도록 확장하는 유망한 축임을 강조한다. 모든 코드와 대화형 시각화는 https://jykoh.com/multi-agent-computer-use에서 공개한다.

English

Computer use agents (CUAs) today are primarily deployed as single serial agents. This setup is suboptimal for complex long-horizon tasks that benefit from task decomposition, parallel execution, and consistent re-planning based on new information. In this paper, we argue that we should instead move towards evaluating and building multi-agent computer use (MACU) systems. These systems, which emphasize planning and parallel execution, alleviate many of the shortcomings of single-agent CUAs. We propose a general multi-agent setup in which a manager model decomposes computer use tasks as a directed acyclic graph (DAG), encoding relevant dependencies and goals for subagents. At each iteration, the manager dispatches parallel CUA subagents to carry out nodes on the ready frontier of the DAG, and continuously revises the DAG (adding, canceling, or rewriting nodes) as new findings arrive from subagents. This design treats the partially observable environment of computer use as a first class challenge: information that downstream agents may not be able to re-observe are retained and passed forward through the manager and DAG structure. We demonstrate that MACU consistently improves over strong single-agent baselines by 3.4-25.5% on desktop (OSWorld) and web navigation (Online-Mind2Web, WebTailBench, Odysseys) benchmarks, exhibits more favorable test-time scaling, and solves complex long-horizon tasks where single-agent CUAs get stuck. On Odysseys, a long-horizon web navigation benchmark, MACU improves average task completion wall-clock time by {sim} 1.5 times, demonstrating its efficacy in speeding up traditionally slow CUA pipelines. Our findings highlight that multi-agent coordination is a promising axis for scaling computer use agents to work productively for longer and more effectively. We release all code and interactive visualizations at https://jykoh.com/multi-agent-computer-use.