OpenCUA: Open Fundamenten voor Computer-Gebruiksagenten

Samenvatting

Vision-language modellen hebben indrukwekkende capaciteiten getoond als computer-use agents (CUA's) die in staat zijn diverse computertaken te automatiseren. Naarmate hun commerciële potentieel groeit, blijven kritieke details van de meest capabele CUA-systemen gesloten. Omdat deze agents steeds meer digitale interacties zullen bemiddelen en belangrijke beslissingen namens ons zullen uitvoeren, heeft de onderzoeksgemeenschap toegang nodig tot open CUA-frameworks om hun capaciteiten, beperkingen en risico's te bestuderen. Om deze kloof te overbruggen, stellen we OpenCUA voor, een uitgebreid open-source framework voor het schalen van CUA-data en foundation modellen. Ons framework bestaat uit: (1) een annotatie-infrastructuur die naadloos menselijke computer-use demonstraties vastlegt; (2) AgentNet, de eerste grootschalige dataset voor computer-use taken die 3 besturingssystemen en 200+ applicaties en websites omvat; (3) een schaalbare pipeline die demonstraties omzet in staat-actie-paren met reflectieve lange Chain-of-Thought redeneringen die robuuste prestatieverbeteringen ondersteunen naarmate de data schaalt. Onze end-to-end agentmodellen tonen sterke prestaties op CUA-benchmarks. In het bijzonder behaalt OpenCUA-32B een gemiddeld slagingspercentage van 34,8% op OSWorld-Verified, wat een nieuwe state-of-the-art (SOTA) vestigt onder open-source modellen en OpenAI CUA (GPT-4o) overtreft. Verdere analyse bevestigt dat onze aanpak goed generaliseert over domeinen en aanzienlijk profiteert van verhoogde testtijdberekening. We geven onze annotatietool, datasets, code en modellen vrij om open fundamenten te bouwen voor verder CUA-onderzoek.

English

Vision-language models have demonstrated impressive capabilities as computer-use agents (CUAs) capable of automating diverse computer tasks. As their commercial potential grows, critical details of the most capable CUA systems remain closed. As these agents will increasingly mediate digital interactions and execute consequential decisions on our behalf, the research community needs access to open CUA frameworks to study their capabilities, limitations, and risks. To bridge this gap, we propose OpenCUA, a comprehensive open-source framework for scaling CUA data and foundation models. Our framework consists of: (1) an annotation infrastructure that seamlessly captures human computer-use demonstrations; (2) AgentNet, the first large-scale computer-use task dataset spanning 3 operating systems and 200+ applications and websites; (3) a scalable pipeline that transforms demonstrations into state-action pairs with reflective long Chain-of-Thought reasoning that sustain robust performance gains as data scales. Our end-to-end agent models demonstrate strong performance across CUA benchmarks. In particular, OpenCUA-32B achieves an average success rate of 34.8% on OSWorld-Verified, establishing a new state-of-the-art (SOTA) among open-source models and surpassing OpenAI CUA (GPT-4o). Further analysis confirms that our approach generalizes well across domains and benefits significantly from increased test-time computation. We release our annotation tool, datasets, code, and models to build open foundations for further CUA research.

OpenCUA: Open Fundamenten voor Computer-Gebruiksagenten

OpenCUA: Open Foundations for Computer-Use Agents

Samenvatting

Support