에이전트 S2: 컴퓨터 사용 에이전트를 위한 구성적 일반-전문가 프레임워크

초록

컴퓨터 사용 에이전트는 컴퓨터와 모바일 기기의 그래픽 사용자 인터페이스(GUI)와 직접 상호작용하여 디지털 작업을 자동화함으로써, 다양한 사용자 쿼리를 처리하여 인간의 생산성을 크게 향상시킬 수 있는 잠재력을 제공합니다. 그러나 현재의 에이전트들은 몇 가지 중요한 과제에 직면해 있습니다: GUI 요소의 부정확한 위치 파악, 장기적 작업 계획의 어려움, 그리고 다양한 인지 작업을 위해 단일 일반 모델에 의존함으로써 발생하는 성능 병목 현상 등이 있습니다. 이를 해결하기 위해, 우리는 다양한 일반 및 전문 모델 간에 인지 책임을 위임하는 새로운 구성적 프레임워크인 Agent S2를 소개합니다. 우리는 정확한 GUI 위치 파악을 위해 새로운 Mixture-of-Grounding 기법을 제안하고, 변화하는 관찰에 대응하여 여러 시간적 규모에서 동적으로 작업 계획을 개선하는 Proactive Hierarchical Planning을 도입합니다. 평가 결과, Agent S2는 세 가지 주요 컴퓨터 사용 벤치마크에서 최신 기술(SOTA) 성능을 달성했습니다. 특히, Agent S2는 OSWorld의 15단계 및 50단계 평가에서 Claude Computer Use 및 UI-TARS와 같은 선두 기반 에이전트 대비 각각 18.9% 및 32.7%의 상대적 개선을 보였습니다. 또한, Agent S2는 다른 운영 체제와 애플리케이션에 효과적으로 일반화되어, WindowsAgentArena에서는 이전 최고 방법 대비 52.8%, AndroidWorld에서는 16.52%의 상대적 개선을 달성했습니다. 코드는 https://github.com/simular-ai/Agent-S에서 확인할 수 있습니다.

English

Computer use agents automate digital tasks by directly interacting with graphical user interfaces (GUIs) on computers and mobile devices, offering significant potential to enhance human productivity by completing an open-ended space of user queries. However, current agents face significant challenges: imprecise grounding of GUI elements, difficulties with long-horizon task planning, and performance bottlenecks from relying on single generalist models for diverse cognitive tasks. To this end, we introduce Agent S2, a novel compositional framework that delegates cognitive responsibilities across various generalist and specialist models. We propose a novel Mixture-of-Grounding technique to achieve precise GUI localization and introduce Proactive Hierarchical Planning, dynamically refining action plans at multiple temporal scales in response to evolving observations. Evaluations demonstrate that Agent S2 establishes new state-of-the-art (SOTA) performance on three prominent computer use benchmarks. Specifically, Agent S2 achieves 18.9% and 32.7% relative improvements over leading baseline agents such as Claude Computer Use and UI-TARS on the OSWorld 15-step and 50-step evaluation. Moreover, Agent S2 generalizes effectively to other operating systems and applications, surpassing previous best methods by 52.8% on WindowsAgentArena and by 16.52% on AndroidWorld relatively. Code available at https://github.com/simular-ai/Agent-S.

에이전트 S2: 컴퓨터 사용 에이전트를 위한 구성적 일반-전문가 프레임워크

Agent S2: A Compositional Generalist-Specialist Framework for Computer Use Agents

초록

Support