UItron：具備先進感知與規劃能力的基礎GUI代理

摘要

GUI代理旨在實現對移動/PC設備的自動化操作，這是邁向人工通用智慧的重要一步。視覺語言模型（VLM）的快速發展加速了GUI代理的進步，得益於其在視覺理解與任務規劃方面的強大能力。然而，構建GUI代理仍面臨諸多挑戰，包括操作軌跡的稀缺、互動基礎設施的可用性，以及基礎模型初始能力的限制。在本研究中，我們推出了UItron，一個開源的自動GUI代理基礎模型，具備先進的GUI感知、定位與規劃能力。UItron強調了系統化數據工程與互動基礎設施作為推進GUI代理發展的基石。它不僅系統性地研究了一系列提升訓練效果的數據工程策略，還建立了一個連接移動與PC設備的互動環境。在訓練過程中，UItron採用了多種GUI場景下感知與規劃任務的監督微調，隨後開發了一套課程強化學習框架，以支持在線環境中的複雜推理與探索。結果顯示，UItron在GUI感知、定位與規劃的基準測試中表現卓越。特別地，UItron在與頂級中文移動應用的交互熟練度上表現突出，我們發現即使在最先進的解決方案中，中文能力普遍不足。為此，我們手動收集了超過一百萬步的操作軌跡，涵蓋最受歡迎的100款應用，並構建了離線與在線代理評估環境。實驗結果表明，UItron在中文應用場景中取得了顯著進展，推動GUI代理向實際應用邁進了一大步。

English

GUI agent aims to enable automated operations on Mobile/PC devices, which is an important task toward achieving artificial general intelligence. The rapid advancement of VLMs accelerates the development of GUI agents, owing to their powerful capabilities in visual understanding and task planning. However, building a GUI agent remains a challenging task due to the scarcity of operation trajectories, the availability of interactive infrastructure, and the limitation of initial capabilities in foundation models. In this work, we introduce UItron, an open-source foundational model for automatic GUI agents, featuring advanced GUI perception, grounding, and planning capabilities. UItron highlights the necessity of systemic data engineering and interactive infrastructure as foundational components for advancing GUI agent development. It not only systematically studies a series of data engineering strategies to enhance training effects, but also establishes an interactive environment connecting both Mobile and PC devices. In training, UItron adopts supervised finetuning over perception and planning tasks in various GUI scenarios, and then develop a curriculum reinforcement learning framework to enable complex reasoning and exploration for online environments. As a result, UItron achieves superior performance in benchmarks of GUI perception, grounding, and planning. In particular, UItron highlights the interaction proficiency with top-tier Chinese mobile APPs, as we identified a general lack of Chinese capabilities even in state-of-the-art solutions. To this end, we manually collect over one million steps of operation trajectories across the top 100 most popular apps, and build the offline and online agent evaluation environments. Experimental results demonstrate that UItron achieves significant progress in Chinese app scenarios, propelling GUI agents one step closer to real-world application.

UItron：具備先進感知與規劃能力的基礎GUI代理

UItron: Foundational GUI Agent with Advanced Perception and Planning

摘要

Support