UFO2：デスクトップエージェントOS

要旨

近年のマルチモーダル大規模言語モデル（LLM）を活用したコンピュータ利用エージェント（CUA）は、自然言語による複雑なデスクトップワークフローの自動化において有望な方向性を示しています。しかし、既存のCUAの多くは概念的なプロトタイプに留まっており、浅いOS統合、脆弱なスクリーンショットベースのインタラクション、および中断を伴う実行といった課題に直面しています。本論文では、Windowsデスクトップ向けのマルチエージェントAgentOSであるUFO2を提案します。UFO2は、CUAを実用的なシステムレベルの自動化へと進化させます。UFO2は、タスクの分解と調整を行う中央集権型のHostAgentと、ネイティブAPI、ドメイン固有の知識、統一されたGUI-APIアクションレイヤーを備えたアプリケーション特化型のAppAgent群を特徴とします。このアーキテクチャにより、モジュール性と拡張性を保ちつつ、堅牢なタスク実行が可能となります。ハイブリッド制御検出パイプラインは、Windows UI Automation（UIA）とビジョンベースの解析を融合させ、多様なインターフェーススタイルをサポートします。さらに、推測的なマルチアクションプランニングにより、ステップごとのLLMオーバーヘッドを削減し、実行効率を向上させます。最後に、Picture-in-Picture（PiP）インターフェースにより、分離された仮想デスクトップ内での自動化を実現し、エージェントとユーザーが干渉することなく同時に操作できるようにします。 UFO2を20以上の実世界のWindowsアプリケーションで評価し、従来のCUAと比較して堅牢性と実行精度が大幅に向上することを示します。結果から、深いOS統合が、信頼性が高くユーザーに沿ったデスクトップ自動化へのスケーラブルな道を開くことが明らかになりました。

English

Recent Computer-Using Agents (CUAs), powered by multimodal large language models (LLMs), offer a promising direction for automating complex desktop workflows through natural language. However, most existing CUAs remain conceptual prototypes, hindered by shallow OS integration, fragile screenshot-based interaction, and disruptive execution. We present UFO2, a multiagent AgentOS for Windows desktops that elevates CUAs into practical, system-level automation. UFO2 features a centralized HostAgent for task decomposition and coordination, alongside a collection of application-specialized AppAgent equipped with native APIs, domain-specific knowledge, and a unified GUI--API action layer. This architecture enables robust task execution while preserving modularity and extensibility. A hybrid control detection pipeline fuses Windows UI Automation (UIA) with vision-based parsing to support diverse interface styles. Runtime efficiency is further enhanced through speculative multi-action planning, reducing per-step LLM overhead. Finally, a Picture-in-Picture (PiP) interface enables automation within an isolated virtual desktop, allowing agents and users to operate concurrently without interference. We evaluate UFO2 across over 20 real-world Windows applications, demonstrating substantial improvements in robustness and execution accuracy over prior CUAs. Our results show that deep OS integration unlocks a scalable path toward reliable, user-aligned desktop automation.

UFO2：デスクトップエージェントOS

UFO2: The Desktop AgentOS

要旨

Support