UI-Genie:一种基于多模态大语言模型的移动GUI代理迭代增强自优化方法
UI-Genie: A Self-Improving Approach for Iteratively Boosting MLLM-based Mobile GUI Agents
May 27, 2025
作者: Han Xiao, Guozhi Wang, Yuxiang Chai, Zimu Lu, Weifeng Lin, Hao He, Lue Fan, Liuyang Bian, Rui Hu, Liang Liu, Shuai Ren, Yafei Wen, Xiaoxin Chen, Aojun Zhou, Hongsheng Li
cs.AI
摘要
本文介绍了UI-Genie,一种自我优化的框架,旨在解决图形用户界面(GUI)代理中的两大关键挑战:轨迹结果验证困难以及高质量训练数据难以规模化。针对这些挑战,UI-Genie分别通过奖励模型和自我优化流程予以应对。其中,奖励模型UI-Genie-RM采用图像与文本交织的架构,高效处理历史上下文信息,并统一了动作级别与任务级别的奖励机制。为支持UI-Genie-RM的训练,我们开发了精心设计的数据生成策略,包括基于规则的验证、受控轨迹破坏及困难负样本挖掘。针对第二个挑战,自我优化流程通过奖励引导的探索与动态环境中的结果验证,逐步扩展可解决的复杂GUI任务,同时提升代理与奖励模型的能力。为模型训练,我们生成了UI-Genie-RM-517k和UI-Genie-Agent-16k数据集,首次为GUI代理建立了专门的奖励数据集,并展示了无需人工标注即可生成高质量合成轨迹的能力。实验结果表明,UI-Genie在历经三代数据模型自我优化后,在多个GUI代理基准测试中均达到了最先进的性能水平。我们开源了完整的框架实现及生成的数据集,以促进进一步研究,详见https://github.com/Euphoria16/UI-Genie。
English
In this paper, we introduce UI-Genie, a self-improving framework addressing
two key challenges in GUI agents: verification of trajectory outcome is
challenging and high-quality training data are not scalable. These challenges
are addressed by a reward model and a self-improving pipeline, respectively.
The reward model, UI-Genie-RM, features an image-text interleaved architecture
that efficiently pro- cesses historical context and unifies action-level and
task-level rewards. To sup- port the training of UI-Genie-RM, we develop
deliberately-designed data genera- tion strategies including rule-based
verification, controlled trajectory corruption, and hard negative mining. To
address the second challenge, a self-improvement pipeline progressively expands
solvable complex GUI tasks by enhancing both the agent and reward models
through reward-guided exploration and outcome verification in dynamic
environments. For training the model, we generate UI- Genie-RM-517k and
UI-Genie-Agent-16k, establishing the first reward-specific dataset for GUI
agents while demonstrating high-quality synthetic trajectory gen- eration
without manual annotation. Experimental results show that UI-Genie achieves
state-of-the-art performance across multiple GUI agent benchmarks with three
generations of data-model self-improvement. We open-source our complete
framework implementation and generated datasets to facilitate further research
in https://github.com/Euphoria16/UI-Genie.Summary
AI-Generated Summary