OSWorld: 실제 컴퓨터 환경에서의 개방형 작업을 위한 멀티모달 에이전트 벤치마킹

초록

최소한의 인간 개입으로 복잡한 컴퓨터 작업을 수행하는 자율 에이전트는 인간-컴퓨터 상호작용을 혁신하고 접근성과 생산성을 크게 향상시킬 잠재력을 가지고 있습니다. 그러나 기존 벤치마크는 상호작용 환경이 부족하거나 특정 애플리케이션이나 도메인에 한정된 환경만을 제공하여, 현실 세계의 다양하고 복잡한 컴퓨터 사용을 반영하지 못함으로써 작업 범위와 에이전트 확장성을 제한하고 있습니다. 이러한 문제를 해결하기 위해, 우리는 Ubuntu, Windows, macOS와 같은 다양한 운영체제에서 작업 설정, 실행 기반 평가, 상호 학습을 지원하는 최초의 확장 가능한 실제 컴퓨터 환경인 OSWorld를 소개합니다. OSWorld는 임의의 애플리케이션을 포함하는 개방형 컴퓨터 작업을 평가하기 위한 통합된 컴퓨터 환경으로 활용될 수 있습니다. OSWorld를 기반으로, 우리는 실제 웹 및 데스크톱 애플리케이션, OS 파일 입출력, 그리고 여러 애플리케이션에 걸친 워크플로우를 포함하는 369개의 컴퓨터 작업 벤치마크를 구축했습니다. 각 작업 예제는 현실 세계의 컴퓨터 사용 사례에서 도출되었으며, 신뢰할 수 있고 재현 가능한 평가를 위한 상세한 초기 상태 설정 구성과 맞춤형 실행 기반 평가 스크립트를 포함합니다. OSWorld에서 최신 LLM/VLM 기반 에이전트를 광범위하게 평가한 결과, 이들이 컴퓨터 보조 도구로서의 역할을 수행하는 데 있어 상당한 결함이 있음이 드러났습니다. 인간은 작업의 72.36% 이상을 성공적으로 수행할 수 있는 반면, 최고의 모델은 단 12.24%의 성공률을 보였으며, 주로 GUI 기반 작업과 운영 지식에서 어려움을 겪었습니다. OSWorld를 사용한 포괄적인 분석은 이전 벤치마크로는 불가능했던 멀티모달 일반 에이전트 개발에 대한 귀중한 통찰을 제공합니다. 우리의 코드, 환경, 베이스라인 모델, 데이터는 https://os-world.github.io에서 공개적으로 이용 가능합니다.

English

Autonomous agents that accomplish complex computer tasks with minimal human interventions have the potential to transform human-computer interaction, significantly enhancing accessibility and productivity. However, existing benchmarks either lack an interactive environment or are limited to environments specific to certain applications or domains, failing to reflect the diverse and complex nature of real-world computer use, thereby limiting the scope of tasks and agent scalability. To address this issue, we introduce OSWorld, the first-of-its-kind scalable, real computer environment for multimodal agents, supporting task setup, execution-based evaluation, and interactive learning across various operating systems such as Ubuntu, Windows, and macOS. OSWorld can serve as a unified, integrated computer environment for assessing open-ended computer tasks that involve arbitrary applications. Building upon OSWorld, we create a benchmark of 369 computer tasks involving real web and desktop apps in open domains, OS file I/O, and workflows spanning multiple applications. Each task example is derived from real-world computer use cases and includes a detailed initial state setup configuration and a custom execution-based evaluation script for reliable, reproducible evaluation. Extensive evaluation of state-of-the-art LLM/VLM-based agents on OSWorld reveals significant deficiencies in their ability to serve as computer assistants. While humans can accomplish over 72.36% of the tasks, the best model achieves only 12.24% success, primarily struggling with GUI grounding and operational knowledge. Comprehensive analysis using OSWorld provides valuable insights for developing multimodal generalist agents that were not possible with previous benchmarks. Our code, environment, baseline models, and data are publicly available at https://os-world.github.io.

OSWorld: 실제 컴퓨터 환경에서의 개방형 작업을 위한 멀티모달 에이전트 벤치마킹

OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

초록

Support