UItron：具备高级感知与规划能力的GUI基础智能体

摘要

GUI智能体旨在实现移动/PC设备上的自动化操作，这是迈向通用人工智能的重要任务。视觉语言模型（VLM）的快速发展加速了GUI智能体的开发，得益于其在视觉理解和任务规划方面的强大能力。然而，构建GUI智能体仍面临诸多挑战，包括操作轨迹的稀缺、交互基础设施的可用性以及基础模型初始能力的局限性。在本研究中，我们推出了UItron，一个开源的自动GUI智能体基础模型，具备先进的GUI感知、定位和规划能力。UItron强调了系统性数据工程和交互基础设施作为推动GUI智能体发展的基础组件的重要性。它不仅系统研究了一系列数据工程策略以提升训练效果，还建立了一个连接移动和PC设备的交互环境。在训练过程中，UItron采用监督微调方法处理多种GUI场景下的感知与规划任务，随后开发了一套课程强化学习框架，以实现在线环境中的复杂推理与探索。因此，UItron在GUI感知、定位和规划的基准测试中表现卓越。特别地，UItron在与顶级中文移动应用的交互熟练度上表现突出，因为我们发现即便是最先进的解决方案也普遍缺乏中文处理能力。为此，我们手动收集了超过一百万步的操作轨迹，覆盖了最受欢迎的100款应用，并构建了离线与在线智能体评估环境。实验结果表明，UItron在中文应用场景中取得了显著进展，推动GUI智能体向实际应用迈进了一大步。

English

GUI agent aims to enable automated operations on Mobile/PC devices, which is an important task toward achieving artificial general intelligence. The rapid advancement of VLMs accelerates the development of GUI agents, owing to their powerful capabilities in visual understanding and task planning. However, building a GUI agent remains a challenging task due to the scarcity of operation trajectories, the availability of interactive infrastructure, and the limitation of initial capabilities in foundation models. In this work, we introduce UItron, an open-source foundational model for automatic GUI agents, featuring advanced GUI perception, grounding, and planning capabilities. UItron highlights the necessity of systemic data engineering and interactive infrastructure as foundational components for advancing GUI agent development. It not only systematically studies a series of data engineering strategies to enhance training effects, but also establishes an interactive environment connecting both Mobile and PC devices. In training, UItron adopts supervised finetuning over perception and planning tasks in various GUI scenarios, and then develop a curriculum reinforcement learning framework to enable complex reasoning and exploration for online environments. As a result, UItron achieves superior performance in benchmarks of GUI perception, grounding, and planning. In particular, UItron highlights the interaction proficiency with top-tier Chinese mobile APPs, as we identified a general lack of Chinese capabilities even in state-of-the-art solutions. To this end, we manually collect over one million steps of operation trajectories across the top 100 most popular apps, and build the offline and online agent evaluation environments. Experimental results demonstrate that UItron achieves significant progress in Chinese app scenarios, propelling GUI agents one step closer to real-world application.

UItron：具备高级感知与规划能力的GUI基础智能体

UItron: Foundational GUI Agent with Advanced Perception and Planning

摘要

Support