ScaleCUA：利用跨平台数据扩展开源计算机使用代理

摘要

视觉-语言模型（VLMs）已赋能计算机使用代理（CUAs）自主操作图形用户界面（GUI），展现出巨大潜力，但进展受限于缺乏大规模、开源的计算机使用数据及基础模型。本研究推出ScaleCUA，旨在推动开源CUAs的规模化发展。它提供了一个跨越6种操作系统和3个任务领域的大规模数据集，通过自动化代理与人类专家相结合的闭环流程构建而成。基于这一扩展数据训练的ScaleCUA，能够跨平台无缝操作。具体而言，它在基准测试中显著超越基线（WebArena-Lite-v2提升26.6分，ScreenSpot-Pro提升10.7分），并创下多项最新记录（MMBench-GUI L1-Hard达94.4%，OSWorld-G达60.6%，WebArena-Lite-v2达47.4%）。这些成果凸显了数据驱动规模化对于通用计算机使用代理的强大作用。我们将发布数据、模型及代码，以促进未来研究：https://github.com/OpenGVLab/ScaleCUA。

English

Vision-Language Models (VLMs) have enabled computer use agents (CUAs) that operate GUIs autonomously, showing great potential, yet progress is limited by the lack of large-scale, open-source computer use data and foundation models. In this work, we introduce ScaleCUA, a step toward scaling open-source CUAs. It offers a large-scale dataset spanning 6 operating systems and 3 task domains, built via a closed-loop pipeline uniting automated agents with human experts. Trained on this scaled-up data, ScaleCUA can operate seamlessly across platforms. Specifically, it delivers strong gains over baselines (+26.6 on WebArena-Lite-v2, +10.7 on ScreenSpot-Pro) and sets new state-of-the-art results (94.4% on MMBench-GUI L1-Hard, 60.6% on OSWorld-G, 47.4% on WebArena-Lite-v2). These findings underscore the power of data-driven scaling for general-purpose computer use agents. We will release data, models, and code to advance future research: https://github.com/OpenGVLab/ScaleCUA.

ScaleCUA：利用跨平台数据扩展开源计算机使用代理

ScaleCUA: Scaling Open-Source Computer Use Agents with Cross-Platform Data

摘要

Support