UItron:具備先進感知與規劃能力的基礎GUI代理
UItron: Foundational GUI Agent with Advanced Perception and Planning
August 29, 2025
作者: Zhixiong Zeng, Jing Huang, Liming Zheng, Wenkang Han, Yufeng Zhong, Lei Chen, Longrong Yang, Yingjie Chu, Yuzhi He, Lin Ma
cs.AI
摘要
GUI代理旨在實現對移動/PC設備的自動化操作,這是邁向人工通用智慧的重要一步。視覺語言模型(VLM)的快速發展加速了GUI代理的進步,得益於其在視覺理解與任務規劃方面的強大能力。然而,構建GUI代理仍面臨諸多挑戰,包括操作軌跡的稀缺、互動基礎設施的可用性,以及基礎模型初始能力的限制。在本研究中,我們推出了UItron,一個開源的自動GUI代理基礎模型,具備先進的GUI感知、定位與規劃能力。UItron強調了系統化數據工程與互動基礎設施作為推進GUI代理發展的基石。它不僅系統性地研究了一系列提升訓練效果的數據工程策略,還建立了一個連接移動與PC設備的互動環境。在訓練過程中,UItron採用了多種GUI場景下感知與規劃任務的監督微調,隨後開發了一套課程強化學習框架,以支持在線環境中的複雜推理與探索。結果顯示,UItron在GUI感知、定位與規劃的基準測試中表現卓越。特別地,UItron在與頂級中文移動應用的交互熟練度上表現突出,我們發現即使在最先進的解決方案中,中文能力普遍不足。為此,我們手動收集了超過一百萬步的操作軌跡,涵蓋最受歡迎的100款應用,並構建了離線與在線代理評估環境。實驗結果表明,UItron在中文應用場景中取得了顯著進展,推動GUI代理向實際應用邁進了一大步。
English
GUI agent aims to enable automated operations on Mobile/PC devices, which is
an important task toward achieving artificial general intelligence. The rapid
advancement of VLMs accelerates the development of GUI agents, owing to their
powerful capabilities in visual understanding and task planning. However,
building a GUI agent remains a challenging task due to the scarcity of
operation trajectories, the availability of interactive infrastructure, and the
limitation of initial capabilities in foundation models. In this work, we
introduce UItron, an open-source foundational model for automatic GUI agents,
featuring advanced GUI perception, grounding, and planning capabilities. UItron
highlights the necessity of systemic data engineering and interactive
infrastructure as foundational components for advancing GUI agent development.
It not only systematically studies a series of data engineering strategies to
enhance training effects, but also establishes an interactive environment
connecting both Mobile and PC devices. In training, UItron adopts supervised
finetuning over perception and planning tasks in various GUI scenarios, and
then develop a curriculum reinforcement learning framework to enable complex
reasoning and exploration for online environments. As a result, UItron achieves
superior performance in benchmarks of GUI perception, grounding, and planning.
In particular, UItron highlights the interaction proficiency with top-tier
Chinese mobile APPs, as we identified a general lack of Chinese capabilities
even in state-of-the-art solutions. To this end, we manually collect over one
million steps of operation trajectories across the top 100 most popular apps,
and build the offline and online agent evaluation environments. Experimental
results demonstrate that UItron achieves significant progress in Chinese app
scenarios, propelling GUI agents one step closer to real-world application.