ChatPaper.aiChatPaper

图形用户界面探索实验室:通过多轮强化学习增强智能体的屏幕导航能力

GUI Exploration Lab: Enhancing Screen Navigation in Agents via Multi-Turn Reinforcement Learning

December 2, 2025
作者: Haolong Yan, Yeqing Shen, Xin Huang, Jia Wang, Kaijun Tan, Zhixuan Liang, Hongxin Li, Zheng Ge, Osamu Yoshie, Si Li, Xiangyu Zhang, Daxin Jiang
cs.AI

摘要

随着大视觉语言模型的快速发展,图形用户界面智能体任务的研究重点已从单屏幕任务转向复杂的屏幕导航挑战。然而现实中的图形用户界面环境(如电脑软件和移动应用)往往具有复杂性和专有性,难以获取智能体训练与评估所需的完整环境信息,这一局限阻碍了对智能体导航能力的系统性研究和基准测试。为此,我们推出图形用户界面探索实验室——一个专为图形用户界面智能体导航研究设计的模拟环境引擎,该引擎支持灵活定义和组合屏幕、图标及导航图谱,同时提供完整的环境信息访问权限,以实现全面的智能体训练与评估。通过大量实验发现,监督微调能够有效记忆基础知识,为后续训练奠定关键基础。在此基础上,单轮强化学习可进一步增强对未见过场景的泛化能力。最终,通过多轮强化学习中的交互试错过程,智能体可自主探索策略,从而实现屏幕导航性能的持续提升。我们在静态和交互式基准测试中验证了该方法,证明其能有效泛化至实际应用场景。这些发现彰显了强化学习方法在图形用户界面导航中的优势,并为构建更具能力与泛化性的图形用户界面智能体提供了实践指导。
English
With the rapid development of Large Vision Language Models, the focus of Graphical User Interface (GUI) agent tasks shifts from single-screen tasks to complex screen navigation challenges. However, real-world GUI environments, such as PC software and mobile Apps, are often complex and proprietary, making it difficult to obtain the comprehensive environment information needed for agent training and evaluation. This limitation hinders systematic investigation and benchmarking of agent navigation capabilities. To address this limitation, we introduce GUI Exploration Lab, a simulation environment engine for GUI agent navigation research that enables flexible definition and composition of screens, icons, and navigation graphs, while providing full access to environment information for comprehensive agent training and evaluation. Through extensive experiments, we find that supervised fine-tuning enables effective memorization of fundamental knowledge, serving as a crucial foundation for subsequent training. Building on this, single-turn reinforcement learning further enhances generalization to unseen scenarios. Finally, multi-turn reinforcement learning encourages the development of exploration strategies through interactive trial and error, leading to further improvements in screen navigation performance. We validate our methods on both static and interactive benchmarks, demonstrating that our findings generalize effectively to real-world scenarios. These findings demonstrate the advantages of reinforcement learning approaches in GUI navigation and offer practical guidance for building more capable and generalizable GUI agents.
PDF11December 4, 2025