CoAct-1:以编码为行动的计算型智能体
CoAct-1: Computer-using Agents with Coding as Actions
August 5, 2025
作者: Linxin Song, Yutong Dai, Viraj Prabhu, Jieyu Zhang, Taiwei Shi, Li Li, Junnan Li, Silvio Savarese, Zeyuan Chen, Jieyu Zhao, Ran Xu, Caiming Xiong
cs.AI
摘要
通过图形用户界面(GUI)操作计算机的自主代理在处理复杂、长期任务时,往往面临效率与可靠性的挑战。尽管通过增强这些代理的规划能力可以改善任务分解,但它们仍受限于所有操作均需通过GUI操控的固有局限,导致系统脆弱且效率低下。本研究提出了一种更为稳健且灵活的范式:赋予代理使用编码作为增强操作的能力。我们介绍了CoAct-1,一个创新性的多代理系统,它巧妙地将基于GUI的控制与直接程序执行相结合。CoAct-1配备了一个协调器,能够动态地将子任务分配给传统的GUI操作员或专门编程代理,后者能够编写并执行Python或Bash脚本。这种混合策略使得代理能够绕过文件管理和数据处理等任务中低效的GUI操作序列,同时在必要时仍利用视觉交互。我们在具有挑战性的OSWorld基准测试中评估了该系统,CoAct-1以60.76%的成功率创下了新的最先进水平,显著超越了先前的方法。此外,我们的方法大幅提升了效率,将完成任务所需的平均步骤数降至仅10.15步,而领先的GUI代理则需要15步。我们的研究结果表明,将编码作为核心操作整合进来,为通用计算机自动化提供了一条更强大、高效且可扩展的路径。
English
Autonomous agents that operate computers via Graphical User Interfaces (GUIs)
often struggle with efficiency and reliability on complex, long-horizon tasks.
While augmenting these agents with planners can improve task decomposition,
they remain constrained by the inherent limitations of performing all actions
through GUI manipulation, leading to brittleness and inefficiency. In this
work, we introduce a more robust and flexible paradigm: enabling agents to use
coding as a enhanced action. We present CoAct-1, a novel multi-agent system
that synergistically combines GUI-based control with direct programmatic
execution. CoAct-1 features an Orchestrator that dynamically delegates subtasks
to either a conventional GUI Operator or a specialized Programmer agent, which
can write and execute Python or Bash scripts. This hybrid approach allows the
agent to bypass inefficient GUI action sequences for tasks like file management
and data processing, while still leveraging visual interaction when necessary.
We evaluate our system on the challenging OSWorld benchmark, where CoAct-1
achieves a new state-of-the-art success rate of 60.76%, significantly
outperforming prior methods. Furthermore, our approach dramatically improves
efficiency, reducing the average number of steps required to complete a task to
just 10.15, compared to 15 for leading GUI agents. Our results demonstrate that
integrating coding as a core action provides a more powerful, efficient, and
scalable path toward generalized computer automation.