CoAct-1: コーディングを行動とするコンピュータ利用エージェント

要旨

グラフィカルユーザーインターフェース（GUI）を介してコンピュータを操作する自律エージェントは、複雑で長期的なタスクにおいて効率性と信頼性に課題を抱えることが多い。これらのエージェントにプランナーを組み込むことでタスクの分解が改善されるものの、すべてのアクションをGUI操作を通じて実行するという本質的な制約により、脆弱性と非効率性が残る。本研究では、より堅牢で柔軟なパラダイムとして、エージェントがコーディングを強化されたアクションとして利用できるようにする手法を提案する。我々は、GUIベースの制御と直接的なプログラム実行を相乗的に組み合わせた新たなマルチエージェントシステム「CoAct-1」を紹介する。CoAct-1は、従来のGUIオペレーターまたはPythonやBashスクリプトを記述・実行できる専門のプログラマーエージェントにサブタスクを動的に委任するオーケストレーターを備えている。このハイブリッドアプローチにより、ファイル管理やデータ処理などのタスクにおいて非効率なGUIアクションシーケンスを回避しつつ、必要に応じて視覚的なインタラクションを活用することが可能となる。我々は、CoAct-1を挑戦的なOSWorldベンチマークで評価し、60.76%の新たな最先端の成功率を達成し、従来の手法を大幅に上回る結果を示した。さらに、本手法は効率性を劇的に向上させ、タスク完了に必要な平均ステップ数を主要なGUIエージェントの15ステップからわずか10.15ステップに削減した。これらの結果は、コーディングをコアアクションとして統合することが、汎用的なコンピュータ自動化に向けたより強力で効率的かつスケーラブルな道筋を提供することを示している。

English

Autonomous agents that operate computers via Graphical User Interfaces (GUIs) often struggle with efficiency and reliability on complex, long-horizon tasks. While augmenting these agents with planners can improve task decomposition, they remain constrained by the inherent limitations of performing all actions through GUI manipulation, leading to brittleness and inefficiency. In this work, we introduce a more robust and flexible paradigm: enabling agents to use coding as a enhanced action. We present CoAct-1, a novel multi-agent system that synergistically combines GUI-based control with direct programmatic execution. CoAct-1 features an Orchestrator that dynamically delegates subtasks to either a conventional GUI Operator or a specialized Programmer agent, which can write and execute Python or Bash scripts. This hybrid approach allows the agent to bypass inefficient GUI action sequences for tasks like file management and data processing, while still leveraging visual interaction when necessary. We evaluate our system on the challenging OSWorld benchmark, where CoAct-1 achieves a new state-of-the-art success rate of 60.76%, significantly outperforming prior methods. Furthermore, our approach dramatically improves efficiency, reducing the average number of steps required to complete a task to just 10.15, compared to 15 for leading GUI agents. Our results demonstrate that integrating coding as a core action provides a more powerful, efficient, and scalable path toward generalized computer automation.

CoAct-1: コーディングを行動とするコンピュータ利用エージェント

CoAct-1: Computer-using Agents with Coding as Actions

要旨

Support