AppAgent：多模式代理人作為智慧型手機使用者

摘要

近年來大型語言模型（LLMs）的最新進展已經促成了能夠執行複雜任務的智能代理的誕生。本文介紹了一種基於新穎LLM的多模式代理框架，旨在操作智能手機應用程式。我們的框架使代理能夠通過簡化的操作空間來操作智能手機應用程式，模擬人類的互動方式，如點擊和滑動。這種新穎方法避免了對系統後端訪問的需求，從而擴大了其在各種應用程式中的應用範圍。我們代理功能的核心是其創新的學習方法。代理通過自主探索或觀察人類示範來學習如何導航和使用新應用程式。這個過程產生了一個知識庫，代理可以參考這個知識庫來執行跨不同應用程式的複雜任務。為了證明我們代理的實用性，我們在10個不同應用程式中進行了超過50個任務的廣泛測試，包括社交媒體、電子郵件、地圖、購物和複雜的圖像編輯工具。結果證實了我們代理在處理多樣高級任務方面的能力。

English

Recent advancements in large language models (LLMs) have led to the creation of intelligent agents capable of performing complex tasks. This paper introduces a novel LLM-based multimodal agent framework designed to operate smartphone applications. Our framework enables the agent to operate smartphone applications through a simplified action space, mimicking human-like interactions such as tapping and swiping. This novel approach bypasses the need for system back-end access, thereby broadening its applicability across diverse apps. Central to our agent's functionality is its innovative learning method. The agent learns to navigate and use new apps either through autonomous exploration or by observing human demonstrations. This process generates a knowledge base that the agent refers to for executing complex tasks across different applications. To demonstrate the practicality of our agent, we conducted extensive testing over 50 tasks in 10 different applications, including social media, email, maps, shopping, and sophisticated image editing tools. The results affirm our agent's proficiency in handling a diverse array of high-level tasks.

AppAgent：多模式代理人作為智慧型手機使用者

AppAgent: Multimodal Agents as Smartphone Users

摘要

Support