现实威胁下的移动GUI智能体：我们准备好了吗？

摘要

近年来，基于大语言模型的移动端GUI智能体发展迅猛，这类系统能够根据自然语言指令自主执行多样化的设备控制任务。随着它们在标准测试集上的准确率不断提升，业界对其实景大规模部署的期待日益高涨，目前已有多款商用智能体被早期使用者采纳。然而，我们是否真的做好了将GUI智能体作为系统基础组件集成至日常设备的准备？我们认为，当前缺乏一项关键的部署前验证环节——检验这些智能体在现实威胁下能否保持性能稳定。与基于简单静态应用内容（为确保测试环境一致性必须如此）的现有基准测试不同，真实场景中的应用程序充斥着来自不可信第三方的内容，例如广告邮件、用户生成的内容及媒体等……为此，我们提出可扩展的应用内容植入框架，支持对现有应用进行灵活精准的内容修改。基于该框架，我们构建了包含动态任务执行环境与静态挑战性GUI状态数据集的测试套件。动态环境涵盖122项可复现任务，静态数据集包含从商业应用中构建的3000余个场景。通过对开源与商用GUI智能体的实验发现，所有被测智能体受第三方内容影响均出现性能显著下降，动态与静态环境中的平均误导率分别达到42.0%和36.1%。该框架与基准测试已发布于https://agenthazard.github.io。

English

Recent years have witnessed a rapid development of mobile GUI agents powered by large language models (LLMs), which can autonomously execute diverse device-control tasks based on natural language instructions. The increasing accuracy of these agents on standard benchmarks has raised expectations for large-scale real-world deployment, and there are already several commercial agents released and used by early adopters. However, are we really ready for GUI agents integrated into our daily devices as system building blocks? We argue that an important pre-deployment validation is missing to examine whether the agents can maintain their performance under real-world threats. Specifically, unlike existing common benchmarks that are based on simple static app contents (they have to do so to ensure environment consistency between different tests), real-world apps are filled with contents from untrustworthy third parties, such as advertisement emails, user-generated posts and medias, etc. ... To this end, we introduce a scalable app content instrumentation framework to enable flexible and targeted content modifications within existing applications. Leveraging this framework, we create a test suite comprising both a dynamic task execution environment and a static dataset of challenging GUI states. The dynamic environment encompasses 122 reproducible tasks, and the static dataset consists of over 3,000 scenarios constructed from commercial apps. We perform experiments on both open-source and commercial GUI agents. Our findings reveal that all examined agents can be significantly degraded due to third-party contents, with an average misleading rate of 42.0% and 36.1% in dynamic and static environments respectively. The framework and benchmark has been released at https://agenthazard.github.io.