LLM驱动的GUI代理在手机自动化中的应用:进展与前景综述
LLM-Powered GUI Agents in Phone Automation: Surveying Progress and Prospects
April 28, 2025
作者: Guangyi Liu, Pengxiang Zhao, Liang Liu, Yaxuan Guo, Han Xiao, Weifeng Lin, Yuxiang Chai, Yue Han, Shuai Ren, Hao Wang, Xiaoyu Liang, Wenhao Wang, Tianze Wu, Linghao Li, Hao Wang, Guanjing Xiong, Yong Liu, Hongsheng Li
cs.AI
摘要
随着大型语言模型(LLMs)的迅速崛起,手机自动化技术经历了革命性的变革。本文系统回顾了LLM驱动的手机图形用户界面(GUI)代理,重点阐述了其从基于脚本的自动化向智能、自适应系统的演进过程。我们首先阐述了关键挑战:(一)通用性有限,(二)维护成本高,(三)意图理解能力弱,并展示了LLMs如何通过高级语言理解、多模态感知和稳健的决策能力来解决这些问题。随后,我们提出了一种分类法,涵盖基本代理框架(单代理、多代理、先计划后执行)、建模方法(提示工程、基于训练的方法)以及关键数据集和基准测试。此外,我们详细介绍了连接用户意图与GUI操作的任务特定架构、监督微调和强化学习策略。最后,我们探讨了开放挑战,如数据集多样性、设备端部署效率、以用户为中心的适应性和安全问题,为该快速发展的领域提供了前瞻性见解。通过提供结构化概述并指出紧迫的研究空白,本文为研究人员和从业者利用LLMs设计可扩展、用户友好的手机GUI代理提供了权威参考。
English
With the rapid rise of large language models (LLMs), phone automation has
undergone transformative changes. This paper systematically reviews LLM-driven
phone GUI agents, highlighting their evolution from script-based automation to
intelligent, adaptive systems. We first contextualize key challenges, (i)
limited generality, (ii) high maintenance overhead, and (iii) weak intent
comprehension, and show how LLMs address these issues through advanced language
understanding, multimodal perception, and robust decision-making. We then
propose a taxonomy covering fundamental agent frameworks (single-agent,
multi-agent, plan-then-act), modeling approaches (prompt engineering,
training-based), and essential datasets and benchmarks. Furthermore, we detail
task-specific architectures, supervised fine-tuning, and reinforcement learning
strategies that bridge user intent and GUI operations. Finally, we discuss open
challenges such as dataset diversity, on-device deployment efficiency,
user-centric adaptation, and security concerns, offering forward-looking
insights into this rapidly evolving field. By providing a structured overview
and identifying pressing research gaps, this paper serves as a definitive
reference for researchers and practitioners seeking to harness LLMs in
designing scalable, user-friendly phone GUI agents.Summary
AI-Generated Summary