多智能体计算机使用

摘要

当前的计算机使用智能体（CUA）主要部署为单一序列化智能体。这种架构对于需要任务分解、并行执行及基于新信息持续重规划的复杂长周期任务而言，并非最优方案。本文主张应转向评估和构建多智能体计算机使用（MACU）系统。这类强调规划与并行执行的系统，能够有效缓解单智能体CUA的诸多缺陷。我们提出一种通用多智能体框架：管理者模型将计算机使用任务解构为有向无环图（DAG），为子智能体编码相关依赖关系与目标。在每次迭代中，管理者将并行派发CUA子智能体执行DAG就绪边界上的节点，并随子智能体返回的新发现持续修订DAG（增删或重写节点）。该设计将计算机使用的部分可观测环境视为首要挑战：下游智能体可能无法重新观测到的信息，将通过管理者与DAG结构得以保留与传递。实验表明，MACU在桌面任务（OSWorld）与网页导航（Online-Mind2Web、WebTailBench、Odysseys）基准测试中，相较强大的单智能体基线系统始终提升3.4%-25.5%的性能，展现出更优的测试时扩展性，并能解决单智能体CUA无法完成的复杂长周期任务。在长周期网页导航基准Odysseys上，MACU将任务完成平均耗时缩短约1.5倍，证明了其在加速传统CUA流程方面的有效性。我们的研究揭示，多智能体协调是推动计算机使用智能体在更长时间内高效工作的富有前景的扩展方向。相关代码与交互式可视化工具已发布于https://jykoh.com/multi-agent-computer-use。

English

Computer use agents (CUAs) today are primarily deployed as single serial agents. This setup is suboptimal for complex long-horizon tasks that benefit from task decomposition, parallel execution, and consistent re-planning based on new information. In this paper, we argue that we should instead move towards evaluating and building multi-agent computer use (MACU) systems. These systems, which emphasize planning and parallel execution, alleviate many of the shortcomings of single-agent CUAs. We propose a general multi-agent setup in which a manager model decomposes computer use tasks as a directed acyclic graph (DAG), encoding relevant dependencies and goals for subagents. At each iteration, the manager dispatches parallel CUA subagents to carry out nodes on the ready frontier of the DAG, and continuously revises the DAG (adding, canceling, or rewriting nodes) as new findings arrive from subagents. This design treats the partially observable environment of computer use as a first class challenge: information that downstream agents may not be able to re-observe are retained and passed forward through the manager and DAG structure. We demonstrate that MACU consistently improves over strong single-agent baselines by 3.4-25.5% on desktop (OSWorld) and web navigation (Online-Mind2Web, WebTailBench, Odysseys) benchmarks, exhibits more favorable test-time scaling, and solves complex long-horizon tasks where single-agent CUAs get stuck. On Odysseys, a long-horizon web navigation benchmark, MACU improves average task completion wall-clock time by {sim} 1.5 times, demonstrating its efficacy in speeding up traditionally slow CUA pipelines. Our findings highlight that multi-agent coordination is a promising axis for scaling computer use agents to work productively for longer and more effectively. We release all code and interactive visualizations at https://jykoh.com/multi-agent-computer-use.