Agent S2：一個用於電腦使用代理的組合式通用-專才框架

摘要

電腦使用代理通過直接與電腦和移動設備上的圖形用戶界面（GUI）進行互動，自動化執行數位任務，為完成開放式用戶查詢提供了顯著提升人類生產力的潛力。然而，當前代理面臨著重大挑戰：GUI元素的定位不精確、長時程任務規劃的困難，以及依賴單一通用模型處理多樣認知任務所導致的性能瓶頸。為此，我們引入了Agent S2，這是一種新穎的組合框架，將認知職責分配給各種通用和專用模型。我們提出了一種新穎的混合定位技術，以實現精確的GUI定位，並引入了主動分層規劃，根據不斷變化的觀察在多個時間尺度上動態調整行動計劃。評估結果顯示，Agent S2在三個主要的電腦使用基準測試中建立了新的最先進（SOTA）性能。具體而言，Agent S2在OSWorld的15步和50步評估中，相較於Claude Computer Use和UI-TARS等領先基線代理，分別實現了18.9%和32.7%的相對改進。此外，Agent S2在其他操作系統和應用程序上也能有效泛化，在WindowsAgentArena和AndroidWorld上分別超越了之前最佳方法52.8%和16.52%。代碼可在https://github.com/simular-ai/Agent-S獲取。

English

Computer use agents automate digital tasks by directly interacting with graphical user interfaces (GUIs) on computers and mobile devices, offering significant potential to enhance human productivity by completing an open-ended space of user queries. However, current agents face significant challenges: imprecise grounding of GUI elements, difficulties with long-horizon task planning, and performance bottlenecks from relying on single generalist models for diverse cognitive tasks. To this end, we introduce Agent S2, a novel compositional framework that delegates cognitive responsibilities across various generalist and specialist models. We propose a novel Mixture-of-Grounding technique to achieve precise GUI localization and introduce Proactive Hierarchical Planning, dynamically refining action plans at multiple temporal scales in response to evolving observations. Evaluations demonstrate that Agent S2 establishes new state-of-the-art (SOTA) performance on three prominent computer use benchmarks. Specifically, Agent S2 achieves 18.9% and 32.7% relative improvements over leading baseline agents such as Claude Computer Use and UI-TARS on the OSWorld 15-step and 50-step evaluation. Moreover, Agent S2 generalizes effectively to other operating systems and applications, surpassing previous best methods by 52.8% on WindowsAgentArena and by 16.52% on AndroidWorld relatively. Code available at https://github.com/simular-ai/Agent-S.

Agent S2：一個用於電腦使用代理的組合式通用-專才框架

Agent S2: A Compositional Generalist-Specialist Framework for Computer Use Agents

摘要

Support