ChatPaper.aiChatPaper

OpenCUA:计算机使用代理的开放基础架构

OpenCUA: Open Foundations for Computer-Use Agents

August 12, 2025
作者: Xinyuan Wang, Bowen Wang, Dunjie Lu, Junlin Yang, Tianbao Xie, Junli Wang, Jiaqi Deng, Xiaole Guo, Yiheng Xu, Chen Henry Wu, Zhennan Shen, Zhuokai Li, Ryan Li, Xiaochuan Li, Junda Chen, Boyuan Zheng, Peihang Li, Fangyu Lei, Ruisheng Cao, Yeqiao Fu, Dongchan Shin, Martin Shin, Jiarui Hu, Yuyan Wang, Jixuan Chen, Yuxiao Ye, Danyang Zhang, Dikang Du, Hao Hu, Huarong Chen, Zaida Zhou, Yipu Wang, Heng Wang, Diyi Yang, Victor Zhong, Flood Sung, Y. Charles, Zhilin Yang, Tao Yu
cs.AI

摘要

视觉语言模型已展现出作为计算机使用代理(CUA)的卓越能力,能够自动化执行多样化的计算机任务。随着其商业潜力的增长,最先进的CUA系统的关键细节仍处于封闭状态。鉴于这些代理将越来越多地调解数字交互并代表我们执行重要决策,研究界需要开放获取CUA框架,以研究其能力、局限性和风险。为弥合这一差距,我们提出了OpenCUA,一个全面的开源框架,用于扩展CUA数据和基础模型。我们的框架包括:(1)一个无缝捕捉人类计算机使用演示的标注基础设施;(2)AgentNet,首个跨越3个操作系统和200多个应用程序及网站的大规模计算机使用任务数据集;(3)一个可扩展的管道,将演示转化为状态-动作对,并伴随反思性的长链思维推理,确保随着数据规模的扩大,性能增益持续稳健。我们的端到端代理模型在CUA基准测试中表现出色。特别是,OpenCUA-32B在OSWorld-Verified上实现了34.8%的平均成功率,在开源模型中确立了新的最先进水平(SOTA),并超越了OpenAI的CUA(GPT-4o)。进一步分析证实,我们的方法在跨领域泛化良好,并显著受益于增加的测试时计算。我们发布了标注工具、数据集、代码和模型,为CUA的进一步研究构建开放基础。
English
Vision-language models have demonstrated impressive capabilities as computer-use agents (CUAs) capable of automating diverse computer tasks. As their commercial potential grows, critical details of the most capable CUA systems remain closed. As these agents will increasingly mediate digital interactions and execute consequential decisions on our behalf, the research community needs access to open CUA frameworks to study their capabilities, limitations, and risks. To bridge this gap, we propose OpenCUA, a comprehensive open-source framework for scaling CUA data and foundation models. Our framework consists of: (1) an annotation infrastructure that seamlessly captures human computer-use demonstrations; (2) AgentNet, the first large-scale computer-use task dataset spanning 3 operating systems and 200+ applications and websites; (3) a scalable pipeline that transforms demonstrations into state-action pairs with reflective long Chain-of-Thought reasoning that sustain robust performance gains as data scales. Our end-to-end agent models demonstrate strong performance across CUA benchmarks. In particular, OpenCUA-32B achieves an average success rate of 34.8% on OSWorld-Verified, establishing a new state-of-the-art (SOTA) among open-source models and surpassing OpenAI CUA (GPT-4o). Further analysis confirms that our approach generalizes well across domains and benefits significantly from increased test-time computation. We release our annotation tool, datasets, code, and models to build open foundations for further CUA research.
PDF212August 13, 2025