マルチエージェントによるコンピュータ利用

要旨

現在、コンピュータ利用エージェント（CUA）は主に単一のシリアルエージェントとして展開されている。この設定は、タスク分解、並列実行、新たな情報に基づく一貫した再計画が有益な複雑な長期タスクには最適ではない。本稿では、マルチエージェントコンピュータ利用（MACU）システムの評価と構築に移行すべきであると主張する。これらのシステムは計画と並列実行を重視し、単一エージェントCUAの多くの欠点を緩和する。我々は、マネージャモデルがコンピュータ利用タスクを有向非巡回グラフ（DAG）として分解し、サブエージェントの依存関係と目標をエンコードする汎用的なマルチエージェント構成を提案する。各イテレーションにおいて、マネージャは並列CUAサブエージェントを派遣し、DAGの準備完了フロンティア上のノードを実行させるとともに、サブエージェントから新たな知見が得られるたびにDAGを継続的に修正（ノードの追加、キャンセル、書き換え）する。この設計は、コンピュータ利用の部分観測環境を第一級の課題として扱い、下流エージェントが再観測できない可能性のある情報を、マネージャとDAG構造を通じて保持・伝達する。我々は、MACUがデスクトップ（OSWorld）およびWebナビゲーション（Online-Mind2Web、WebTailBench、Odysseys）のベンチマークにおいて、強力な単一エージェントベースラインを3.4～25.5%一貫して上回り、より有利なテスト時スケーリングを示し、単一エージェントCUAが行き詰まる複雑な長期タスクを解決することを実証する。長期WebナビゲーションベンチマークであるOdysseysでは、MACUによりタスク完了の平均壁時計時間が約1.5倍改善され、従来の低速なCUAパイプラインの高速化における有効性を示している。我々の知見は、マルチエージェント連携がコンピュータ利用エージェントをより長く効果的に動作させるための有望な拡張軸であることを強調する。すべてのコードとインタラクティブな可視化はhttps://jykoh.com/multi-agent-computer-useで公開している。

English

Computer use agents (CUAs) today are primarily deployed as single serial agents. This setup is suboptimal for complex long-horizon tasks that benefit from task decomposition, parallel execution, and consistent re-planning based on new information. In this paper, we argue that we should instead move towards evaluating and building multi-agent computer use (MACU) systems. These systems, which emphasize planning and parallel execution, alleviate many of the shortcomings of single-agent CUAs. We propose a general multi-agent setup in which a manager model decomposes computer use tasks as a directed acyclic graph (DAG), encoding relevant dependencies and goals for subagents. At each iteration, the manager dispatches parallel CUA subagents to carry out nodes on the ready frontier of the DAG, and continuously revises the DAG (adding, canceling, or rewriting nodes) as new findings arrive from subagents. This design treats the partially observable environment of computer use as a first class challenge: information that downstream agents may not be able to re-observe are retained and passed forward through the manager and DAG structure. We demonstrate that MACU consistently improves over strong single-agent baselines by 3.4-25.5% on desktop (OSWorld) and web navigation (Online-Mind2Web, WebTailBench, Odysseys) benchmarks, exhibits more favorable test-time scaling, and solves complex long-horizon tasks where single-agent CUAs get stuck. On Odysseys, a long-horizon web navigation benchmark, MACU improves average task completion wall-clock time by {sim} 1.5 times, demonstrating its efficacy in speeding up traditionally slow CUA pipelines. Our findings highlight that multi-agent coordination is a promising axis for scaling computer use agents to work productively for longer and more effectively. We release all code and interactive visualizations at https://jykoh.com/multi-agent-computer-use.