CoAct-1: 코딩을 행동으로 활용하는 컴퓨터 사용 에이전트

초록

그래픽 사용자 인터페이스(GUI)를 통해 컴퓨터를 운영하는 자율 에이전트는 복잡하고 장기적인 작업에서 효율성과 신뢰성에 어려움을 겪는 경우가 많다. 이러한 에이전트에 플래너를 추가하여 작업 분해를 개선할 수 있지만, 모든 동작을 GUI 조작을 통해 수행해야 한다는 본질적인 한계로 인해 취약성과 비효율성이 여전히 존재한다. 본 연구에서는 더 강력하고 유연한 패러다임을 소개한다: 에이전트가 코딩을 강화된 동작으로 사용할 수 있도록 하는 것이다. 우리는 GUI 기반 제어와 직접적인 프로그래밍 실행을 시너지적으로 결합한 새로운 다중 에이전트 시스템인 CoAct-1을 제시한다. CoAct-1은 오케스트레이터를 통해 서브태스크를 기존의 GUI 오퍼레이터 또는 Python이나 Bash 스크립트를 작성하고 실행할 수 있는 특화된 프로그래머 에이전트에게 동적으로 위임한다. 이 하이브리드 접근 방식은 파일 관리 및 데이터 처리와 같은 작업에서 비효율적인 GUI 동작 시퀀스를 우회할 수 있게 하면서도 필요할 때는 시각적 상호작용을 여전히 활용할 수 있도록 한다. 우리는 이 시스템을 도전적인 OSWorld 벤치마크에서 평가하였으며, CoAct-1은 60.76%의 새로운 최첨단 성공률을 달성하여 기존 방법들을 크게 능가했다. 또한, 우리의 접근 방식은 작업 완료에 필요한 평균 단계 수를 선두 GUI 에이전트의 15단계에서 단 10.15단계로 크게 줄여 효율성을 극적으로 개선했다. 우리의 결과는 코딩을 핵심 동작으로 통합함으로써 일반화된 컴퓨터 자동화를 위한 더 강력하고 효율적이며 확장 가능한 경로를 제공한다는 것을 보여준다.

English

Autonomous agents that operate computers via Graphical User Interfaces (GUIs) often struggle with efficiency and reliability on complex, long-horizon tasks. While augmenting these agents with planners can improve task decomposition, they remain constrained by the inherent limitations of performing all actions through GUI manipulation, leading to brittleness and inefficiency. In this work, we introduce a more robust and flexible paradigm: enabling agents to use coding as a enhanced action. We present CoAct-1, a novel multi-agent system that synergistically combines GUI-based control with direct programmatic execution. CoAct-1 features an Orchestrator that dynamically delegates subtasks to either a conventional GUI Operator or a specialized Programmer agent, which can write and execute Python or Bash scripts. This hybrid approach allows the agent to bypass inefficient GUI action sequences for tasks like file management and data processing, while still leveraging visual interaction when necessary. We evaluate our system on the challenging OSWorld benchmark, where CoAct-1 achieves a new state-of-the-art success rate of 60.76%, significantly outperforming prior methods. Furthermore, our approach dramatically improves efficiency, reducing the average number of steps required to complete a task to just 10.15, compared to 15 for leading GUI agents. Our results demonstrate that integrating coding as a core action provides a more powerful, efficient, and scalable path toward generalized computer automation.

CoAct-1: 코딩을 행동으로 활용하는 컴퓨터 사용 에이전트

CoAct-1: Computer-using Agents with Coding as Actions

초록

Support