ChatPaper.aiChatPaper

CoAct-1:以编码为行动的计算机使用智能体

CoAct-1: Computer-using Agents with Coding as Actions

August 5, 2025
作者: Linxin Song, Yutong Dai, Viraj Prabhu, Jieyu Zhang, Taiwei Shi, Li Li, Junnan Li, Silvio Savarese, Zeyuan Chen, Jieyu Zhao, Ran Xu, Caiming Xiong
cs.AI

摘要

通过图形用户界面(GUI)操作计算机的自主代理在处理复杂、长期任务时,往往面临效率和可靠性的挑战。虽然通过增加规划器来改进任务分解可以提升这些代理的能力,但它们仍受限于所有操作必须通过GUI执行的固有局限,导致系统脆弱且效率低下。在本研究中,我们引入了一种更为稳健和灵活的范式:赋予代理使用编码作为增强操作的能力。我们提出了CoAct-1,一个新颖的多代理系统,它协同结合了基于GUI的控制与直接程序执行。CoAct-1配备了一个协调器,能够动态地将子任务分配给传统的GUI操作员或专门的程序员代理,后者能够编写并执行Python或Bash脚本。这种混合方法使代理能够绕过低效的GUI操作序列,如文件管理和数据处理,同时在必要时仍利用视觉交互。我们在具有挑战性的OSWorld基准测试中评估了我们的系统,CoAct-1实现了60.76%的最新成功率,显著超越了先前的方法。此外,我们的方法大幅提升了效率,将完成任务所需的平均步骤数降至仅10.15步,而领先的GUI代理则需要15步。我们的结果表明,将编码作为核心操作集成,为通用计算机自动化提供了一条更强大、高效且可扩展的路径。
English
Autonomous agents that operate computers via Graphical User Interfaces (GUIs) often struggle with efficiency and reliability on complex, long-horizon tasks. While augmenting these agents with planners can improve task decomposition, they remain constrained by the inherent limitations of performing all actions through GUI manipulation, leading to brittleness and inefficiency. In this work, we introduce a more robust and flexible paradigm: enabling agents to use coding as a enhanced action. We present CoAct-1, a novel multi-agent system that synergistically combines GUI-based control with direct programmatic execution. CoAct-1 features an Orchestrator that dynamically delegates subtasks to either a conventional GUI Operator or a specialized Programmer agent, which can write and execute Python or Bash scripts. This hybrid approach allows the agent to bypass inefficient GUI action sequences for tasks like file management and data processing, while still leveraging visual interaction when necessary. We evaluate our system on the challenging OSWorld benchmark, where CoAct-1 achieves a new state-of-the-art success rate of 60.76%, significantly outperforming prior methods. Furthermore, our approach dramatically improves efficiency, reducing the average number of steps required to complete a task to just 10.15, compared to 15 for leading GUI agents. Our results demonstrate that integrating coding as a core action provides a more powerful, efficient, and scalable path toward generalized computer automation.
PDF93August 8, 2025