Windows Agent Arena: 在规模上评估多模态操作系统代理

摘要

大型语言模型（LLMs）展现出显著潜力，可以作为计算机代理，增强人类在需要规划和推理的多模态任务中的生产力和软件可访问性。然而，在现实环境中衡量代理性能仍然是一个挑战，因为：（i）大多数基准测试局限于特定模态或领域（例如仅文本、Web导航、问答、编码等），（ii）全面的基准评估由于任务的多步骤顺序性质而变得缓慢（需要数天的时间量级）。为了解决这些挑战，我们介绍了Windows Agent Arena：一个可复现的通用环境，专注于Windows操作系统（OS），代理可以在真实的Windows OS中自由操作，并在解决任务时使用与人类用户相同的广泛应用程序、工具和Web浏览器。我们改编了OSWorld框架（Xie等，2024年），创建了150多个跨代表性领域的多样化Windows任务，这些任务需要代理在规划、屏幕理解和工具使用方面的能力。我们的基准测试具有可扩展性，并可以在Azure中轻松并行化，以在短短20分钟内进行全面的基准评估。为展示Windows Agent Arena的能力，我们还介绍了一个新的多模态代理Navi。我们的代理在Windows领域的成功率为19.5%，而无人协助的人类表现为74.5%。Navi还展示了在另一个流行的基于Web的基准测试Mind2Web上的良好表现。我们提供了对Navi性能的广泛定量和定性分析，并就未来在代理开发和使用Windows Agent Arena进行数据生成方面的研究机会提供了见解。网页：https://microsoft.github.io/WindowsAgentArena 代码：https://github.com/microsoft/WindowsAgentArena

English

Large language models (LLMs) show remarkable potential to act as computer agents, enhancing human productivity and software accessibility in multi-modal tasks that require planning and reasoning. However, measuring agent performance in realistic environments remains a challenge since: (i) most benchmarks are limited to specific modalities or domains (e.g. text-only, web navigation, Q&A, coding) and (ii) full benchmark evaluations are slow (on order of magnitude of days) given the multi-step sequential nature of tasks. To address these challenges, we introduce the Windows Agent Arena: a reproducible, general environment focusing exclusively on the Windows operating system (OS) where agents can operate freely within a real Windows OS and use the same wide range of applications, tools, and web browsers available to human users when solving tasks. We adapt the OSWorld framework (Xie et al., 2024) to create 150+ diverse Windows tasks across representative domains that require agent abilities in planning, screen understanding, and tool usage. Our benchmark is scalable and can be seamlessly parallelized in Azure for a full benchmark evaluation in as little as 20 minutes. To demonstrate Windows Agent Arena's capabilities, we also introduce a new multi-modal agent, Navi. Our agent achieves a success rate of 19.5% in the Windows domain, compared to 74.5% performance of an unassisted human. Navi also demonstrates strong performance on another popular web-based benchmark, Mind2Web. We offer extensive quantitative and qualitative analysis of Navi's performance, and provide insights into the opportunities for future research in agent development and data generation using Windows Agent Arena. Webpage: https://microsoft.github.io/WindowsAgentArena Code: https://github.com/microsoft/WindowsAgentArena

Windows Agent Arena: 在规模上评估多模态操作系统代理

Windows Agent Arena: Evaluating Multi-Modal OS Agents at Scale

摘要

Support