Windows Agent Arena: 在規模上評估多模式作業系統代理程式

摘要

大型語言模型（LLMs）展現出卓越的潛力，可作為電腦代理，增強人類在多模式任務中的生產力和軟體可及性，這些任務需要規劃和推理。然而，在現實環境中評估代理的表現仍然是一個挑戰，因為：（i）大多數基準測試僅限於特定的模式或領域（例如僅文本、網頁導航、問答、編碼），以及（ii）全面的基準評估速度緩慢（需要數天的時間），因為任務具有多步驟的連續性質。為了應對這些挑戰，我們引入了Windows代理競技場：一個可重現的通用環境，專注於Windows操作系統（OS），在這裡代理可以在真實的Windows OS中自由操作，並在解決任務時使用與人類用戶相同的廣泛應用程式、工具和網頁瀏覽器。我們適應了OSWorld框架（Xie等人，2024年），創建了150多個多樣化的Windows任務，涵蓋了需要代理在規劃、屏幕理解和工具使用方面的能力的代表性領域。我們的基準測試具有可擴展性，可以在Azure中無縫並行化進行全面的基準評估，僅需20分鐘。為了展示Windows代理競技場的功能，我們還介紹了一個新的多模式代理Navi。我們的代理在Windows領域的成功率為19.5％，而無輔助的人類表現為74.5％。Navi在另一個流行的基於Web的基準測試Mind2Web上也表現出色。我們對Navi的表現進行了廣泛的定量和定性分析，並提供了有關未來研究機會的見解，這些研究機會涉及代理開發和使用Windows代理競技場進行數據生成。網頁：https://microsoft.github.io/WindowsAgentArena 代碼：https://github.com/microsoft/WindowsAgentArena

English

Large language models (LLMs) show remarkable potential to act as computer agents, enhancing human productivity and software accessibility in multi-modal tasks that require planning and reasoning. However, measuring agent performance in realistic environments remains a challenge since: (i) most benchmarks are limited to specific modalities or domains (e.g. text-only, web navigation, Q&A, coding) and (ii) full benchmark evaluations are slow (on order of magnitude of days) given the multi-step sequential nature of tasks. To address these challenges, we introduce the Windows Agent Arena: a reproducible, general environment focusing exclusively on the Windows operating system (OS) where agents can operate freely within a real Windows OS and use the same wide range of applications, tools, and web browsers available to human users when solving tasks. We adapt the OSWorld framework (Xie et al., 2024) to create 150+ diverse Windows tasks across representative domains that require agent abilities in planning, screen understanding, and tool usage. Our benchmark is scalable and can be seamlessly parallelized in Azure for a full benchmark evaluation in as little as 20 minutes. To demonstrate Windows Agent Arena's capabilities, we also introduce a new multi-modal agent, Navi. Our agent achieves a success rate of 19.5% in the Windows domain, compared to 74.5% performance of an unassisted human. Navi also demonstrates strong performance on another popular web-based benchmark, Mind2Web. We offer extensive quantitative and qualitative analysis of Navi's performance, and provide insights into the opportunities for future research in agent development and data generation using Windows Agent Arena. Webpage: https://microsoft.github.io/WindowsAgentArena Code: https://github.com/microsoft/WindowsAgentArena

Windows Agent Arena: 在規模上評估多模式作業系統代理程式

Windows Agent Arena: Evaluating Multi-Modal OS Agents at Scale

摘要

Support