OpenCUA:计算机使用代理的开放基础
OpenCUA: Open Foundations for Computer-Use Agents
August 12, 2025
作者: Xinyuan Wang, Bowen Wang, Dunjie Lu, Junlin Yang, Tianbao Xie, Junli Wang, Jiaqi Deng, Xiaole Guo, Yiheng Xu, Chen Henry Wu, Zhennan Shen, Zhuokai Li, Ryan Li, Xiaochuan Li, Junda Chen, Boyuan Zheng, Peihang Li, Fangyu Lei, Ruisheng Cao, Yeqiao Fu, Dongchan Shin, Martin Shin, Jiarui Hu, Yuyan Wang, Jixuan Chen, Yuxiao Ye, Danyang Zhang, Dikang Du, Hao Hu, Huarong Chen, Zaida Zhou, Yipu Wang, Heng Wang, Diyi Yang, Victor Zhong, Flood Sung, Y. Charles, Zhilin Yang, Tao Yu
cs.AI
摘要
视觉语言模型已展现出作为计算机使用代理(CUA)的卓越能力,能够自动化执行多样化的计算机任务。随着其商业潜力的增长,最先进的CUA系统的关键细节仍处于封闭状态。鉴于这些代理将越来越多地调解数字交互并代表我们执行重要决策,研究界需要开放CUA框架来深入探究其能力、局限及风险。为填补这一空白,我们提出了OpenCUA,一个全面的开源框架,旨在扩展CUA数据与基础模型。该框架包含:(1)一套无缝捕捉人类计算机使用示范的标注基础设施;(2)AgentNet,首个跨3个操作系统、覆盖200多个应用与网站的大规模计算机使用任务数据集;(3)一个可扩展的流程,将示范转化为状态-动作对,并融入反思性长链思维推理,确保随着数据规模扩大,性能持续稳健提升。我们的端到端代理模型在CUA基准测试中表现强劲,特别是OpenCUA-32B在OSWorld-Verified上平均成功率达到了34.8%,在开源模型中树立了新的技术标杆(SOTA),超越了OpenAI的CUA(GPT-4o)。进一步分析证实,我们的方法跨领域泛化能力强,且显著受益于测试时计算资源的增加。我们公开了标注工具、数据集、代码及模型,为CUA的进一步研究构建开放基础。
English
Vision-language models have demonstrated impressive capabilities as
computer-use agents (CUAs) capable of automating diverse computer tasks. As
their commercial potential grows, critical details of the most capable CUA
systems remain closed. As these agents will increasingly mediate digital
interactions and execute consequential decisions on our behalf, the research
community needs access to open CUA frameworks to study their capabilities,
limitations, and risks. To bridge this gap, we propose OpenCUA, a comprehensive
open-source framework for scaling CUA data and foundation models. Our framework
consists of: (1) an annotation infrastructure that seamlessly captures human
computer-use demonstrations; (2) AgentNet, the first large-scale computer-use
task dataset spanning 3 operating systems and 200+ applications and websites;
(3) a scalable pipeline that transforms demonstrations into state-action pairs
with reflective long Chain-of-Thought reasoning that sustain robust performance
gains as data scales. Our end-to-end agent models demonstrate strong
performance across CUA benchmarks. In particular, OpenCUA-32B achieves an
average success rate of 34.8% on OSWorld-Verified, establishing a new
state-of-the-art (SOTA) among open-source models and surpassing OpenAI CUA
(GPT-4o). Further analysis confirms that our approach generalizes well across
domains and benefits significantly from increased test-time computation. We
release our annotation tool, datasets, code, and models to build open
foundations for further CUA research.