AppAgent：多模态代理作为智能手机用户

摘要

最近大型语言模型（LLMs）的进展导致了智能代理的产生，这些代理能够执行复杂任务。本文介绍了一种基于新颖LLM的多模态代理框架，旨在操作智能手机应用程序。我们的框架使代理能够通过简化的动作空间操作智能手机应用程序，模仿人类的交互，如点击和滑动。这种新颖方法绕过了对系统后端访问的需求，从而扩大了其在各种应用程序中的适用性。我们代理功能的核心是其创新性学习方法。代理通过自主探索或观察人类演示来学习如何导航和使用新应用程序。这一过程生成了一个知识库，代理可用于执行跨不同应用程序的复杂任务。为了展示我们代理的实用性，我们在10个不同应用程序中的50个任务上进行了广泛测试，包括社交媒体、电子邮件、地图、购物和复杂的图像编辑工具。结果证实了我们代理在处理各种高级任务方面的熟练程度。

English

Recent advancements in large language models (LLMs) have led to the creation of intelligent agents capable of performing complex tasks. This paper introduces a novel LLM-based multimodal agent framework designed to operate smartphone applications. Our framework enables the agent to operate smartphone applications through a simplified action space, mimicking human-like interactions such as tapping and swiping. This novel approach bypasses the need for system back-end access, thereby broadening its applicability across diverse apps. Central to our agent's functionality is its innovative learning method. The agent learns to navigate and use new apps either through autonomous exploration or by observing human demonstrations. This process generates a knowledge base that the agent refers to for executing complex tasks across different applications. To demonstrate the practicality of our agent, we conducted extensive testing over 50 tasks in 10 different applications, including social media, email, maps, shopping, and sophisticated image editing tools. The results affirm our agent's proficiency in handling a diverse array of high-level tasks.

AppAgent：多模态代理作为智能手机用户

AppAgent: Multimodal Agents as Smartphone Users

摘要

Support