UItron：高度な知覚と計画能力を備えた基盤的GUIエージェント

要旨

GUIエージェントは、モバイル/PCデバイス上での自動操作を可能にすることを目指しており、人工汎用知能の実現に向けた重要な課題です。視覚言語モデル（VLM）の急速な進展は、視覚理解とタスク計画における強力な能力により、GUIエージェントの開発を加速しています。しかし、操作軌跡の不足、インタラクティブインフラの可用性、基盤モデルの初期能力の限界などにより、GUIエージェントの構築は依然として困難な課題です。本研究では、自動GUIエージェントのためのオープンソース基盤モデルであるUItronを紹介します。UItronは、高度なGUI知覚、グラウンディング、計画能力を特徴としています。UItronは、GUIエージェント開発を進めるための基盤として、体系的なデータエンジニアリングとインタラクティブインフラの必要性を強調しています。トレーニング効果を向上させるための一連のデータエンジニアリング戦略を体系的に研究するだけでなく、モバイルとPCデバイスの両方を接続するインタラクティブ環境を構築します。トレーニングにおいて、UItronは様々なGUIシナリオにおける知覚と計画タスクに対して教師ありファインチューニングを採用し、その後、オンライン環境での複雑な推論と探索を可能にするカリキュラム強化学習フレームワークを開発します。その結果、UItronはGUI知覚、グラウンディング、計画のベンチマークで優れた性能を達成します。特に、UItronはトップクラスの中国モバイルアプリとのインタラクション能力を強調しており、最先端のソリューションにおいても中国語能力が一般的に不足していることを認識しました。この目的のために、我々は最も人気のある100のアプリにわたる100万ステップ以上の操作軌跡を手動で収集し、オフラインおよびオンラインのエージェント評価環境を構築しました。実験結果は、UItronが中国アプリシナリオにおいて大きな進歩を達成し、GUIエージェントを実世界の応用に一歩近づけることを示しています。

English

GUI agent aims to enable automated operations on Mobile/PC devices, which is an important task toward achieving artificial general intelligence. The rapid advancement of VLMs accelerates the development of GUI agents, owing to their powerful capabilities in visual understanding and task planning. However, building a GUI agent remains a challenging task due to the scarcity of operation trajectories, the availability of interactive infrastructure, and the limitation of initial capabilities in foundation models. In this work, we introduce UItron, an open-source foundational model for automatic GUI agents, featuring advanced GUI perception, grounding, and planning capabilities. UItron highlights the necessity of systemic data engineering and interactive infrastructure as foundational components for advancing GUI agent development. It not only systematically studies a series of data engineering strategies to enhance training effects, but also establishes an interactive environment connecting both Mobile and PC devices. In training, UItron adopts supervised finetuning over perception and planning tasks in various GUI scenarios, and then develop a curriculum reinforcement learning framework to enable complex reasoning and exploration for online environments. As a result, UItron achieves superior performance in benchmarks of GUI perception, grounding, and planning. In particular, UItron highlights the interaction proficiency with top-tier Chinese mobile APPs, as we identified a general lack of Chinese capabilities even in state-of-the-art solutions. To this end, we manually collect over one million steps of operation trajectories across the top 100 most popular apps, and build the offline and online agent evaluation environments. Experimental results demonstrate that UItron achieves significant progress in Chinese app scenarios, propelling GUI agents one step closer to real-world application.

UItron：高度な知覚と計画能力を備えた基盤的GUIエージェント

UItron: Foundational GUI Agent with Advanced Perception and Planning

要旨

Support