OSWorld:在真實計算機環境中對開放式任務進行多模式代理基準測試
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
April 11, 2024
作者: Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh Jing Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, Yitao Liu, Yiheng Xu, Shuyan Zhou, Silvio Savarese, Caiming Xiong, Victor Zhong, Tao Yu
cs.AI
摘要
具有最小人類干預並完成複雜電腦任務的自主代理,有潛力改變人機互動,顯著提升可訪問性和生產力。然而,現有基準要麼缺乏互動環境,要麼僅限於特定應用程序或領域的環境,未能反映現實世界電腦使用的多樣和複雜性,因此限制了任務範圍和代理的可擴展性。為解決此問題,我們介紹了OSWorld,這是首個可擴展的真實電腦環境,適用於多模態代理,支持任務設置、基於執行的評估和跨Ubuntu、Windows和macOS等各種操作系統的互動學習。OSWorld可以作為統一的、集成的電腦環境,用於評估涉及任意應用程序的開放式電腦任務。基於OSWorld,我們創建了一個基準,涉及369個電腦任務,包括真實的Web和桌面應用程序在開放領域中的OS文件I/O,以及跨多個應用程序的工作流程。每個任務示例源於現實世界的電腦用例,包括詳細的初始狀態設置配置和用於可靠、可重現評估的自定義基於執行的評估腳本。在OSWorld上對基於LLM/VLM的最新代理進行廣泛評估,顯示它們作為電腦助手的能力存在明顯缺陷。儘管人類可以完成72.36%以上的任務,但最佳模型僅實現12.24%的成功率,主要困難在於GUI基礎和操作知識。使用OSWorld進行全面分析為開發以前基準無法實現的多模態通用代理提供了有價值的見解。我們的代碼、環境、基準模型和數據可在https://os-world.github.io 公開獲得。
English
Autonomous agents that accomplish complex computer tasks with minimal human
interventions have the potential to transform human-computer interaction,
significantly enhancing accessibility and productivity. However, existing
benchmarks either lack an interactive environment or are limited to
environments specific to certain applications or domains, failing to reflect
the diverse and complex nature of real-world computer use, thereby limiting the
scope of tasks and agent scalability. To address this issue, we introduce
OSWorld, the first-of-its-kind scalable, real computer environment for
multimodal agents, supporting task setup, execution-based evaluation, and
interactive learning across various operating systems such as Ubuntu, Windows,
and macOS. OSWorld can serve as a unified, integrated computer environment for
assessing open-ended computer tasks that involve arbitrary applications.
Building upon OSWorld, we create a benchmark of 369 computer tasks involving
real web and desktop apps in open domains, OS file I/O, and workflows spanning
multiple applications. Each task example is derived from real-world computer
use cases and includes a detailed initial state setup configuration and a
custom execution-based evaluation script for reliable, reproducible evaluation.
Extensive evaluation of state-of-the-art LLM/VLM-based agents on OSWorld
reveals significant deficiencies in their ability to serve as computer
assistants. While humans can accomplish over 72.36% of the tasks, the best
model achieves only 12.24% success, primarily struggling with GUI grounding and
operational knowledge. Comprehensive analysis using OSWorld provides valuable
insights for developing multimodal generalist agents that were not possible
with previous benchmarks. Our code, environment, baseline models, and data are
publicly available at https://os-world.github.io.Summary
AI-Generated Summary