ScaleCUA：利用跨平台數據擴展開源計算機使用代理

摘要

視覺語言模型（VLMs）已使計算機使用代理（CUAs）能夠自主操作圖形用戶界面（GUI），展現出巨大潛力，但由於缺乏大規模、開源的計算機使用數據和基礎模型，進展受到限制。在本研究中，我們介紹了ScaleCUA，這是邁向開源CUAs規模化的一步。它提供了一個涵蓋6個操作系統和3個任務領域的大規模數據集，通過一個閉環管道構建，該管道將自動化代理與人類專家結合。基於這一擴展數據訓練的ScaleCUA能夠無縫跨平台操作。具體而言，它在基準測試中表現出顯著提升（WebArena-Lite-v2上+26.6，ScreenSpot-Pro上+10.7），並創下了新的最先進成果（MMBench-GUI L1-Hard上94.4%，OSWorld-G上60.6%，WebArena-Lite-v2上47.4%）。這些發現凸顯了數據驅動的規模化對於通用計算機使用代理的強大作用。我們將發布數據、模型和代碼以推動未來研究：https://github.com/OpenGVLab/ScaleCUA。

English

Vision-Language Models (VLMs) have enabled computer use agents (CUAs) that operate GUIs autonomously, showing great potential, yet progress is limited by the lack of large-scale, open-source computer use data and foundation models. In this work, we introduce ScaleCUA, a step toward scaling open-source CUAs. It offers a large-scale dataset spanning 6 operating systems and 3 task domains, built via a closed-loop pipeline uniting automated agents with human experts. Trained on this scaled-up data, ScaleCUA can operate seamlessly across platforms. Specifically, it delivers strong gains over baselines (+26.6 on WebArena-Lite-v2, +10.7 on ScreenSpot-Pro) and sets new state-of-the-art results (94.4% on MMBench-GUI L1-Hard, 60.6% on OSWorld-G, 47.4% on WebArena-Lite-v2). These findings underscore the power of data-driven scaling for general-purpose computer use agents. We will release data, models, and code to advance future research: https://github.com/OpenGVLab/ScaleCUA.

ScaleCUA：利用跨平台數據擴展開源計算機使用代理

ScaleCUA: Scaling Open-Source Computer Use Agents with Cross-Platform Data

摘要

Support