ChatPaper.aiChatPaper

OSWorld:在真实计算机环境中对多模态智能体进行开放式任务基准测试

OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

April 11, 2024
作者: Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh Jing Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, Yitao Liu, Yiheng Xu, Shuyan Zhou, Silvio Savarese, Caiming Xiong, Victor Zhong, Tao Yu
cs.AI

摘要

实现复杂计算机任务并最小化人类干预的自主代理有潜力改变人机交互,显著提高可访问性和生产力。然而,现有基准要么缺乏交互环境,要么仅限于特定应用程序或领域的环境,未能反映真实计算机使用的多样复杂性,从而限制了任务范围和代理的可扩展性。为解决这一问题,我们引入了OSWorld,这是一种首创的可扩展真实计算机环境,用于支持多模态代理的任务设置、基于执行的评估和跨Ubuntu、Windows和macOS等各种操作系统的交互式学习。OSWorld可作为一个统一的、集成的计算机环境,用于评估涉及任意应用程序的开放式计算机任务。基于OSWorld,我们创建了一个基准,涉及369个计算机任务,涵盖开放领域中的真实网络和桌面应用程序、操作系统文件I/O以及跨多个应用程序的工作流。每个任务示例源自真实世界的计算机使用案例,包括详细的初始状态设置配置和用于可靠、可重现评估的自定义基于执行的评估脚本。在OSWorld上对基于LLM/VLM的最新代理进行广泛评估揭示了它们作为计算机助手的显著不足。尽管人类可以完成72.36%以上的任务,但最佳模型仅实现了12.24%的成功率,主要困难在于GUI基础和操作知识。利用OSWorld进行全面分析为开发以前基准无法实现的多模态通用代理提供了宝贵的见解。我们的代码、环境、基准模型和数据可在https://os-world.github.io 公开获取。
English
Autonomous agents that accomplish complex computer tasks with minimal human interventions have the potential to transform human-computer interaction, significantly enhancing accessibility and productivity. However, existing benchmarks either lack an interactive environment or are limited to environments specific to certain applications or domains, failing to reflect the diverse and complex nature of real-world computer use, thereby limiting the scope of tasks and agent scalability. To address this issue, we introduce OSWorld, the first-of-its-kind scalable, real computer environment for multimodal agents, supporting task setup, execution-based evaluation, and interactive learning across various operating systems such as Ubuntu, Windows, and macOS. OSWorld can serve as a unified, integrated computer environment for assessing open-ended computer tasks that involve arbitrary applications. Building upon OSWorld, we create a benchmark of 369 computer tasks involving real web and desktop apps in open domains, OS file I/O, and workflows spanning multiple applications. Each task example is derived from real-world computer use cases and includes a detailed initial state setup configuration and a custom execution-based evaluation script for reliable, reproducible evaluation. Extensive evaluation of state-of-the-art LLM/VLM-based agents on OSWorld reveals significant deficiencies in their ability to serve as computer assistants. While humans can accomplish over 72.36% of the tasks, the best model achieves only 12.24% success, primarily struggling with GUI grounding and operational knowledge. Comprehensive analysis using OSWorld provides valuable insights for developing multimodal generalist agents that were not possible with previous benchmarks. Our code, environment, baseline models, and data are publicly available at https://os-world.github.io.

Summary

AI-Generated Summary

PDF511December 15, 2024