ウィンドウズエージェントアリーナ：規模においてマルチモーダル OS エージェントを評価する

要旨

大規模言語モデル（LLMs）は、コンピューターエージェントとしての顕著な潜在能力を示し、計画や推論を必要とするマルチモーダルタスクにおいて、人間の生産性とソフトウェアの利用性を向上させることができます。しかし、現実的な環境でのエージェントのパフォーマンスを測定することは依然として課題です。なぜなら、ほとんどのベンチマークが特定のモダリティやドメイン（例：テキストのみ、Webナビゲーション、Q&A、コーディング）に限定されているためですし、また、タスクの多段階の連続性があるため、完全なベンチマーク評価は遅い（数日のオーダー）ことがあります。これらの課題に対処するために、私たちはWindows Agent Arenaを導入します。これは、Windowsオペレーティングシステム（OS）に特化した再現可能な一般環境であり、エージェントが実際のWindows OS内で自由に操作し、タスクを解決する際に人間のユーザーが利用できる同じ幅広いアプリケーション、ツール、Webブラウザを使用できます。私たちはOSWorldフレームワーク（Xie et al.、2024）を適応して、計画、画面理解、ツールの使用能力が必要な代表的なドメインにわたる150以上の多様なWindowsタスクを作成します。私たちのベンチマークはスケーラブルであり、Azureでシームレスに並列化することができ、わずか20分で完全なベンチマーク評価を行うことができます。Windows Agent Arenaの機能を示すために、新しいマルチモーダルエージェントであるNaviを紹介します。当該エージェントは、Windowsドメインにおいて成功率19.5%を達成し、無補助の人間の74.5%のパフォーマンスと比較されます。Naviは、別の人気のあるWebベースのベンチマークであるMind2Webでも高いパフォーマンスを示しています。Naviのパフォーマンスについての包括的な定量的および定性的分析を提供し、Windows Agent Arenaを使用したエージェント開発とデータ生成の将来の研究の機会についての洞察を提供します。 Webページ：https://microsoft.github.io/WindowsAgentArena コード：https://github.com/microsoft/WindowsAgentArena

English

Large language models (LLMs) show remarkable potential to act as computer agents, enhancing human productivity and software accessibility in multi-modal tasks that require planning and reasoning. However, measuring agent performance in realistic environments remains a challenge since: (i) most benchmarks are limited to specific modalities or domains (e.g. text-only, web navigation, Q&A, coding) and (ii) full benchmark evaluations are slow (on order of magnitude of days) given the multi-step sequential nature of tasks. To address these challenges, we introduce the Windows Agent Arena: a reproducible, general environment focusing exclusively on the Windows operating system (OS) where agents can operate freely within a real Windows OS and use the same wide range of applications, tools, and web browsers available to human users when solving tasks. We adapt the OSWorld framework (Xie et al., 2024) to create 150+ diverse Windows tasks across representative domains that require agent abilities in planning, screen understanding, and tool usage. Our benchmark is scalable and can be seamlessly parallelized in Azure for a full benchmark evaluation in as little as 20 minutes. To demonstrate Windows Agent Arena's capabilities, we also introduce a new multi-modal agent, Navi. Our agent achieves a success rate of 19.5% in the Windows domain, compared to 74.5% performance of an unassisted human. Navi also demonstrates strong performance on another popular web-based benchmark, Mind2Web. We offer extensive quantitative and qualitative analysis of Navi's performance, and provide insights into the opportunities for future research in agent development and data generation using Windows Agent Arena. Webpage: https://microsoft.github.io/WindowsAgentArena Code: https://github.com/microsoft/WindowsAgentArena

ウィンドウズエージェントアリーナ：規模においてマルチモーダル OS エージェントを評価する

Windows Agent Arena: Evaluating Multi-Modal OS Agents at Scale

要旨

Support