OSWorld：在真實計算機環境中對開放式任務進行多模式代理基準測試

摘要

具有最小人類干預並完成複雜電腦任務的自主代理，有潛力改變人機互動，顯著提升可訪問性和生產力。然而，現有基準要麼缺乏互動環境，要麼僅限於特定應用程序或領域的環境，未能反映現實世界電腦使用的多樣和複雜性，因此限制了任務範圍和代理的可擴展性。為解決此問題，我們介紹了OSWorld，這是首個可擴展的真實電腦環境，適用於多模態代理，支持任務設置、基於執行的評估和跨Ubuntu、Windows和macOS等各種操作系統的互動學習。OSWorld可以作為統一的、集成的電腦環境，用於評估涉及任意應用程序的開放式電腦任務。基於OSWorld，我們創建了一個基準，涉及369個電腦任務，包括真實的Web和桌面應用程序在開放領域中的OS文件I/O，以及跨多個應用程序的工作流程。每個任務示例源於現實世界的電腦用例，包括詳細的初始狀態設置配置和用於可靠、可重現評估的自定義基於執行的評估腳本。在OSWorld上對基於LLM/VLM的最新代理進行廣泛評估，顯示它們作為電腦助手的能力存在明顯缺陷。儘管人類可以完成72.36%以上的任務，但最佳模型僅實現12.24%的成功率，主要困難在於GUI基礎和操作知識。使用OSWorld進行全面分析為開發以前基準無法實現的多模態通用代理提供了有價值的見解。我們的代碼、環境、基準模型和數據可在https://os-world.github.io 公開獲得。

English

Autonomous agents that accomplish complex computer tasks with minimal human interventions have the potential to transform human-computer interaction, significantly enhancing accessibility and productivity. However, existing benchmarks either lack an interactive environment or are limited to environments specific to certain applications or domains, failing to reflect the diverse and complex nature of real-world computer use, thereby limiting the scope of tasks and agent scalability. To address this issue, we introduce OSWorld, the first-of-its-kind scalable, real computer environment for multimodal agents, supporting task setup, execution-based evaluation, and interactive learning across various operating systems such as Ubuntu, Windows, and macOS. OSWorld can serve as a unified, integrated computer environment for assessing open-ended computer tasks that involve arbitrary applications. Building upon OSWorld, we create a benchmark of 369 computer tasks involving real web and desktop apps in open domains, OS file I/O, and workflows spanning multiple applications. Each task example is derived from real-world computer use cases and includes a detailed initial state setup configuration and a custom execution-based evaluation script for reliable, reproducible evaluation. Extensive evaluation of state-of-the-art LLM/VLM-based agents on OSWorld reveals significant deficiencies in their ability to serve as computer assistants. While humans can accomplish over 72.36% of the tasks, the best model achieves only 12.24% success, primarily struggling with GUI grounding and operational knowledge. Comprehensive analysis using OSWorld provides valuable insights for developing multimodal generalist agents that were not possible with previous benchmarks. Our code, environment, baseline models, and data are publicly available at https://os-world.github.io.

OSWorld：在真實計算機環境中對開放式任務進行多模式代理基準測試

OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

摘要

Support