OpenCUA：计算机使用代理的开放基础

摘要

视觉语言模型已展现出作为计算机使用代理（CUA）的卓越能力，能够自动化执行多样化的计算机任务。随着其商业潜力的增长，最先进的CUA系统的关键细节仍处于封闭状态。鉴于这些代理将越来越多地调解数字交互并代表我们执行重要决策，研究界需要开放CUA框架来深入探究其能力、局限及风险。为填补这一空白，我们提出了OpenCUA，一个全面的开源框架，旨在扩展CUA数据与基础模型。该框架包含：（1）一套无缝捕捉人类计算机使用示范的标注基础设施；（2）AgentNet，首个跨3个操作系统、覆盖200多个应用与网站的大规模计算机使用任务数据集；（3）一个可扩展的流程，将示范转化为状态-动作对，并融入反思性长链思维推理，确保随着数据规模扩大，性能持续稳健提升。我们的端到端代理模型在CUA基准测试中表现强劲，特别是OpenCUA-32B在OSWorld-Verified上平均成功率达到了34.8%，在开源模型中树立了新的技术标杆（SOTA），超越了OpenAI的CUA（GPT-4o）。进一步分析证实，我们的方法跨领域泛化能力强，且显著受益于测试时计算资源的增加。我们公开了标注工具、数据集、代码及模型，为CUA的进一步研究构建开放基础。

English

Vision-language models have demonstrated impressive capabilities as computer-use agents (CUAs) capable of automating diverse computer tasks. As their commercial potential grows, critical details of the most capable CUA systems remain closed. As these agents will increasingly mediate digital interactions and execute consequential decisions on our behalf, the research community needs access to open CUA frameworks to study their capabilities, limitations, and risks. To bridge this gap, we propose OpenCUA, a comprehensive open-source framework for scaling CUA data and foundation models. Our framework consists of: (1) an annotation infrastructure that seamlessly captures human computer-use demonstrations; (2) AgentNet, the first large-scale computer-use task dataset spanning 3 operating systems and 200+ applications and websites; (3) a scalable pipeline that transforms demonstrations into state-action pairs with reflective long Chain-of-Thought reasoning that sustain robust performance gains as data scales. Our end-to-end agent models demonstrate strong performance across CUA benchmarks. In particular, OpenCUA-32B achieves an average success rate of 34.8% on OSWorld-Verified, establishing a new state-of-the-art (SOTA) among open-source models and surpassing OpenAI CUA (GPT-4o). Further analysis confirms that our approach generalizes well across domains and benefits significantly from increased test-time computation. We release our annotation tool, datasets, code, and models to build open foundations for further CUA research.

OpenCUA：计算机使用代理的开放基础

OpenCUA: Open Foundations for Computer-Use Agents

摘要

Support