ScaleCUA: 크로스 플랫폼 데이터를 활용한 오픈소스 컴퓨터 사용 에이전트 확장

초록

비전-언어 모델(VLMs)은 GUI를 자율적으로 조작하는 컴퓨터 사용 에이전트(CUAs)를 가능하게 하여 큰 잠재력을 보여주고 있지만, 대규모 오픈소스 컴퓨터 사용 데이터와 기초 모델의 부족으로 인해 진전이 제한되고 있습니다. 본 연구에서는 오픈소스 CUAs의 확장을 위한 한 걸음인 ScaleCUA를 소개합니다. ScaleCUA는 6개의 운영 체제와 3개의 작업 영역을 아우르는 대규모 데이터셋을 제공하며, 자동화된 에이전트와 인간 전문가를 결합한 폐쇄 루프 파이프라인을 통해 구축되었습니다. 이 확장된 데이터로 학습된 ScaleCUA는 다양한 플랫폼에서 원활하게 작동할 수 있습니다. 특히, 베이스라인 대비 큰 성능 향상을 보여주며(WebArena-Lite-v2에서 +26.6, ScreenSpot-Pro에서 +10.7), 새로운 최첨단 결과를 달성했습니다(MMBench-GUI L1-Hard에서 94.4%, OSWorld-G에서 60.6%, WebArena-Lite-v2에서 47.4%). 이러한 결과는 일반 목적의 컴퓨터 사용 에이전트를 위한 데이터 기반 확장의 힘을 강조합니다. 향후 연구를 촉진하기 위해 데이터, 모델 및 코드를 공개할 예정입니다: https://github.com/OpenGVLab/ScaleCUA.

English

Vision-Language Models (VLMs) have enabled computer use agents (CUAs) that operate GUIs autonomously, showing great potential, yet progress is limited by the lack of large-scale, open-source computer use data and foundation models. In this work, we introduce ScaleCUA, a step toward scaling open-source CUAs. It offers a large-scale dataset spanning 6 operating systems and 3 task domains, built via a closed-loop pipeline uniting automated agents with human experts. Trained on this scaled-up data, ScaleCUA can operate seamlessly across platforms. Specifically, it delivers strong gains over baselines (+26.6 on WebArena-Lite-v2, +10.7 on ScreenSpot-Pro) and sets new state-of-the-art results (94.4% on MMBench-GUI L1-Hard, 60.6% on OSWorld-G, 47.4% on WebArena-Lite-v2). These findings underscore the power of data-driven scaling for general-purpose computer use agents. We will release data, models, and code to advance future research: https://github.com/OpenGVLab/ScaleCUA.

ScaleCUA: 크로스 플랫폼 데이터를 활용한 오픈소스 컴퓨터 사용 에이전트 확장

ScaleCUA: Scaling Open-Source Computer Use Agents with Cross-Platform Data

초록

Support