移动GUI智能体面临的现实威胁:我们真的准备好了吗?
Mobile GUI Agents under Real-world Threats: Are We There Yet?
April 14, 2026
作者: Guohong Liu, Jialei Ye, Jiacheng Liu, Yuanchun Li, Wei Liu, Pengzhi Gao, Jian Luan, Yunxin Liu
cs.AI
摘要
近年来,基于大语言模型的移动端图形用户界面智能体快速发展,这类智能体能够根据自然语言指令自主执行多样化的设备控制任务。随着它们在标准测试基准上的准确率不断提升,业界对大规模实际部署的期望日益高涨,目前已有多款商用智能体被早期使用者采纳。然而,我们是否真的做好了将GUI智能体作为系统基础组件集成到日常设备中的准备?我们认为当前缺乏一项重要的部署前验证环节——检验这些智能体能否在现实威胁下保持性能稳定。与现有基于简单静态应用内容的通用测试基准(为确保不同测试间环境一致性而不得不如此设计)不同,真实世界应用充斥着来自不可信第三方的内容,例如广告邮件、用户生成的帖子和媒体等。为此,我们引入了可扩展的应用内容植入框架,支持对现有应用进行灵活定向的内容修改。基于该框架,我们创建了包含动态任务执行环境和静态挑战性GUI状态数据集的测试套件。动态环境涵盖122项可复现任务,静态数据集包含从商业应用中构建的3000余个场景。我们对开源和商用GUI智能体进行了实验验证,发现所有被测智能体均会因第三方内容出现性能显著下降,在动态和静态环境中的平均误导率分别达到42.0%和36.1%。该框架与测试基准已发布于https://agenthazard.github.io。
English
Recent years have witnessed a rapid development of mobile GUI agents powered by large language models (LLMs), which can autonomously execute diverse device-control tasks based on natural language instructions. The increasing accuracy of these agents on standard benchmarks has raised expectations for large-scale real-world deployment, and there are already several commercial agents released and used by early adopters. However, are we really ready for GUI agents integrated into our daily devices as system building blocks? We argue that an important pre-deployment validation is missing to examine whether the agents can maintain their performance under real-world threats. Specifically, unlike existing common benchmarks that are based on simple static app contents (they have to do so to ensure environment consistency between different tests), real-world apps are filled with contents from untrustworthy third parties, such as advertisement emails, user-generated posts and medias, etc. ... To this end, we introduce a scalable app content instrumentation framework to enable flexible and targeted content modifications within existing applications. Leveraging this framework, we create a test suite comprising both a dynamic task execution environment and a static dataset of challenging GUI states. The dynamic environment encompasses 122 reproducible tasks, and the static dataset consists of over 3,000 scenarios constructed from commercial apps. We perform experiments on both open-source and commercial GUI agents. Our findings reveal that all examined agents can be significantly degraded due to third-party contents, with an average misleading rate of 42.0% and 36.1% in dynamic and static environments respectively. The framework and benchmark has been released at https://agenthazard.github.io.