UI-TARS: ネイティブエージェントとの自動化されたGUIインタラクションの先駆け

要旨

本論文では、UI-TARSを紹介します。これは、スクリーンショットのみを入力として認識し、人間のようなインタラクション（キーボードやマウス操作など）を実行するネイティブGUIエージェントモデルです。従来のエージェントフレームワークとは異なり、UI-TARSは専門家によって作成されたプロンプトやワークフローに依存しないエンドツーエンドモデルであり、これらの洗練されたフレームワークを凌駕しています。実験により、その優れた性能が示されています。UI-TARSは、認識、グラウンディング、およびGUIタスクの実行を評価する10以上のGUIエージェントベンチマークでSOTAのパフォーマンスを達成しています。特に、OSWorldベンチマークでは、UI-TARSは50ステップで24.6、15ステップで22.7のスコアを達成し、Claude（それぞれ22.0と14.9）を凌駕しています。AndroidWorldでは、UI-TARSは46.6を達成し、GPT-4o（34.5）を上回っています。UI-TARSにはいくつかの主要な革新が組み込まれています。1つ目は、大規模なGUIスクリーンショットデータセットを活用した「強化認識」であり、UI要素のコンテキストを理解し、正確なキャプション付けを実現しています。2つ目は「統一されたアクションモデリング」であり、アクションをプラットフォーム間で統一された空間に標準化し、大規模なアクショントレースを通じて正確なグラウンディングとインタラクションを実現しています。3つ目は「System-2 Reasoning」であり、複数の推論パターン（タスク分解、反射思考、マイルストーン認識など）を含む、多段階の意思決定に熟慮した推論を組み込んでいます。4つ目は「反射的オンライントレースによる反復的トレーニング」であり、数百の仮想マシン上で新しいインタラクショントレースを自動的に収集し、フィルタリングし、反射的に洗練することでデータボトルネックに対処しています。反復的なトレーニングと反射的な調整を通じて、UI-TARSは継続的に自らの間違いから学び、最小限の人間介入で予期せぬ状況に適応しています。また、GUIエージェントの進化経路を分析し、この分野のさらなる発展を指針として示しています。

English

This paper introduces UI-TARS, a native GUI agent model that solely perceives the screenshots as input and performs human-like interactions (e.g., keyboard and mouse operations). Unlike prevailing agent frameworks that depend on heavily wrapped commercial models (e.g., GPT-4o) with expert-crafted prompts and workflows, UI-TARS is an end-to-end model that outperforms these sophisticated frameworks. Experiments demonstrate its superior performance: UI-TARS achieves SOTA performance in 10+ GUI agent benchmarks evaluating perception, grounding, and GUI task execution. Notably, in the OSWorld benchmark, UI-TARS achieves scores of 24.6 with 50 steps and 22.7 with 15 steps, outperforming Claude (22.0 and 14.9 respectively). In AndroidWorld, UI-TARS achieves 46.6, surpassing GPT-4o (34.5). UI-TARS incorporates several key innovations: (1) Enhanced Perception: leveraging a large-scale dataset of GUI screenshots for context-aware understanding of UI elements and precise captioning; (2) Unified Action Modeling, which standardizes actions into a unified space across platforms and achieves precise grounding and interaction through large-scale action traces; (3) System-2 Reasoning, which incorporates deliberate reasoning into multi-step decision making, involving multiple reasoning patterns such as task decomposition, reflection thinking, milestone recognition, etc. (4) Iterative Training with Reflective Online Traces, which addresses the data bottleneck by automatically collecting, filtering, and reflectively refining new interaction traces on hundreds of virtual machines. Through iterative training and reflection tuning, UI-TARS continuously learns from its mistakes and adapts to unforeseen situations with minimal human intervention. We also analyze the evolution path of GUI agents to guide the further development of this domain.

UI-TARS: ネイティブエージェントとの自動化されたGUIインタラクションの先駆け

UI-TARS: Pioneering Automated GUI Interaction with Native Agents

要旨

Support