多代理電腦使用

摘要

現今的電腦使用代理（CUA）主要部署為單一序列代理。這種設定對於受益於任務分解、並行執行與根據新資訊持續重新規劃的複雜長期任務而言，並非最佳方案。本文主張，我們應轉向評估與建構多代理電腦使用（MACU）系統。這類系統強調規劃與並行執行，能緩解單一代理CUA的諸多缺點。我們提出一個通用多代理架構，其中管理模型將電腦使用任務分解為有向無環圖（DAG），編碼子代理所需的相依關係與目標。每次迭代中，管理模型派遣並行的CUA子代理執行DAG中處於就緒前緣的節點，並根據子代理回傳的新發現持續修訂DAG（新增、取消或改寫節點）。此設計將電腦使用環境的部分可觀察性視為首要挑戰：下游代理可能無法重新觀察到的資訊，將透過管理模型與DAG結構保留並向前傳遞。我們證明，在桌面（OSWorld）與網頁導航（Online-Mind2Web、WebTailBench、Odysseys）基準測試中，MACU相較於強大的單一代理基線模型持續提升3.4%至25.5%的效能，展現更佳的測試時可擴展性，並能解決單一代理CUA卡關的複雜長期任務。在長期網頁導航基準測試Odysseys中，MACU將平均任務完成時間（wall-clock time）縮短約1.5倍，展現其加速傳統緩慢CUA管線的效能。我們的研究發現指出，多代理協調是將電腦使用代理擴展至更長久且更有效工作的可行方向。我們已在 https://jykoh.com/multi-agent-computer-use 釋出所有程式碼與互動式視覺化工具。

English

Computer use agents (CUAs) today are primarily deployed as single serial agents. This setup is suboptimal for complex long-horizon tasks that benefit from task decomposition, parallel execution, and consistent re-planning based on new information. In this paper, we argue that we should instead move towards evaluating and building multi-agent computer use (MACU) systems. These systems, which emphasize planning and parallel execution, alleviate many of the shortcomings of single-agent CUAs. We propose a general multi-agent setup in which a manager model decomposes computer use tasks as a directed acyclic graph (DAG), encoding relevant dependencies and goals for subagents. At each iteration, the manager dispatches parallel CUA subagents to carry out nodes on the ready frontier of the DAG, and continuously revises the DAG (adding, canceling, or rewriting nodes) as new findings arrive from subagents. This design treats the partially observable environment of computer use as a first class challenge: information that downstream agents may not be able to re-observe are retained and passed forward through the manager and DAG structure. We demonstrate that MACU consistently improves over strong single-agent baselines by 3.4-25.5% on desktop (OSWorld) and web navigation (Online-Mind2Web, WebTailBench, Odysseys) benchmarks, exhibits more favorable test-time scaling, and solves complex long-horizon tasks where single-agent CUAs get stuck. On Odysseys, a long-horizon web navigation benchmark, MACU improves average task completion wall-clock time by {sim} 1.5 times, demonstrating its efficacy in speeding up traditionally slow CUA pipelines. Our findings highlight that multi-agent coordination is a promising axis for scaling computer use agents to work productively for longer and more effectively. We release all code and interactive visualizations at https://jykoh.com/multi-agent-computer-use.