OpenCUA: コンピュータ利用エージェントのためのオープン基盤

要旨

ビジョン・ランゲージモデルは、多様なコンピュータタスクを自動化可能なコンピュータ利用エージェント（CUA）として、印象的な能力を発揮してきました。その商業的ポテンシャルが高まるにつれ、最も優れたCUAシステムの重要な詳細は非公開のままです。これらのエージェントがデジタルインタラクションを仲介し、私たちに代わって重要な決定を実行するようになるにつれ、研究コミュニティはその能力、限界、リスクを研究するためのオープンなCUAフレームワークへのアクセスを必要としています。このギャップを埋めるため、私たちはOpenCUAを提案します。これはCUAデータと基盤モデルをスケーリングするための包括的なオープンソースフレームワークです。私たちのフレームワークは以下の要素で構成されます：(1) 人間のコンピュータ利用デモンストレーションをシームレスに記録するアノテーションインフラストラクチャ、(2) 3つのオペレーティングシステムと200以上のアプリケーション・ウェブサイトにまたがる初の大規模コンピュータ利用タスクデータセットであるAgentNet、(3) デモンストレーションを状態-行動ペアに変換し、データのスケーリングに伴って堅牢な性能向上を維持する反射的な長いChain-of-Thought推論を可能にするスケーラブルなパイプライン。私たちのエンドツーエンドエージェントモデルは、CUAベンチマーク全体で強力な性能を示しています。特に、OpenCUA-32BはOSWorld-Verifiedで平均成功率34.8%を達成し、オープンソースモデルの中で新たな最先端（SOTA）を確立し、OpenAI CUA（GPT-4o）を上回りました。さらなる分析により、私たちのアプローチがドメイン間でうまく一般化し、テスト時の計算量の増加から大きな恩恵を受けることが確認されました。私たちは、さらなるCUA研究のためのオープンな基盤を構築するために、アノテーションツール、データセット、コード、モデルを公開します。

English

Vision-language models have demonstrated impressive capabilities as computer-use agents (CUAs) capable of automating diverse computer tasks. As their commercial potential grows, critical details of the most capable CUA systems remain closed. As these agents will increasingly mediate digital interactions and execute consequential decisions on our behalf, the research community needs access to open CUA frameworks to study their capabilities, limitations, and risks. To bridge this gap, we propose OpenCUA, a comprehensive open-source framework for scaling CUA data and foundation models. Our framework consists of: (1) an annotation infrastructure that seamlessly captures human computer-use demonstrations; (2) AgentNet, the first large-scale computer-use task dataset spanning 3 operating systems and 200+ applications and websites; (3) a scalable pipeline that transforms demonstrations into state-action pairs with reflective long Chain-of-Thought reasoning that sustain robust performance gains as data scales. Our end-to-end agent models demonstrate strong performance across CUA benchmarks. In particular, OpenCUA-32B achieves an average success rate of 34.8% on OSWorld-Verified, establishing a new state-of-the-art (SOTA) among open-source models and surpassing OpenAI CUA (GPT-4o). Further analysis confirms that our approach generalizes well across domains and benefits significantly from increased test-time computation. We release our annotation tool, datasets, code, and models to build open foundations for further CUA research.