에이전트 S: 컴퓨터를 인간처럼 활용하는 오픈 에이전틱 프레임워크

초록

우리는 자동화된 복잡하고 다단계 작업을 자동화하여 인간-컴퓨터 상호작용을 변화시키기 위한 그래픽 사용자 인터페이스(GUI)를 통해 컴퓨터와 자율적으로 상호작용할 수 있게 하는 오픈 에이전트 프레임워크인 에이전트 S를 제시합니다. 에이전트 S는 컴퓨터 작업을 자동화하는 데 있어서 세 가지 주요 도전 과제를 해결하기 위해 설계되었습니다: 도메인 특정 지식 습득, 긴 작업 범위에 걸친 계획 수립, 그리고 동적이고 균일하지 않은 인터페이스 다루기. 이를 위해 에이전트 S는 외부 지식 검색 및 내부 경험 검색에서 학습하는 경험 증진 계층적 계획을 도입하여 효율적인 작업 계획과 하위 작업 실행을 용이하게 합니다. 더불어, 다중 모달 대형 언어 모델(MLLMs)을 기반으로 한 GUI 에이전트의 추론 및 제어 능력을 더 잘 유도하기 위해 에이전트-컴퓨터 인터페이스(ACI)를 사용합니다. OSWorld 벤치마크에서의 평가 결과, 에이전트 S는 성공률에서 기준 모델을 9.37% 능가하며(83.6% 상대적 향상), 새로운 최첨단 성과를 달성합니다. 포괄적인 분석은 개별 구성 요소의 효과성을 강조하고 향후 개선을 위한 통찰을 제공합니다. 더불어, 에이전트 S는 최근 출시된 WindowsAgentArena 벤치마크에서 다양한 운영 체제에 대한 넓은 일반화 능력을 보여줍니다. 코드는 https://github.com/simular-ai/Agent-S에서 확인할 수 있습니다.

English

We present Agent S, an open agentic framework that enables autonomous interaction with computers through a Graphical User Interface (GUI), aimed at transforming human-computer interaction by automating complex, multi-step tasks. Agent S aims to address three key challenges in automating computer tasks: acquiring domain-specific knowledge, planning over long task horizons, and handling dynamic, non-uniform interfaces. To this end, Agent S introduces experience-augmented hierarchical planning, which learns from external knowledge search and internal experience retrieval at multiple levels, facilitating efficient task planning and subtask execution. In addition, it employs an Agent-Computer Interface (ACI) to better elicit the reasoning and control capabilities of GUI agents based on Multimodal Large Language Models (MLLMs). Evaluation on the OSWorld benchmark shows that Agent S outperforms the baseline by 9.37% on success rate (an 83.6% relative improvement) and achieves a new state-of-the-art. Comprehensive analysis highlights the effectiveness of individual components and provides insights for future improvements. Furthermore, Agent S demonstrates broad generalizability to different operating systems on a newly-released WindowsAgentArena benchmark. Code available at https://github.com/simular-ai/Agent-S.

에이전트 S: 컴퓨터를 인간처럼 활용하는 오픈 에이전틱 프레임워크

Agent S: An Open Agentic Framework that Uses Computers Like a Human

초록

Support