ChatPaper.aiChatPaper

行動代理人-E:針對複雜任務的自我演化行動助理

Mobile-Agent-E: Self-Evolving Mobile Assistant for Complex Tasks

January 20, 2025
作者: Zhenhailong Wang, Haiyang Xu, Junyang Wang, Xi Zhang, Ming Yan, Ji Zhang, Fei Huang, Heng Ji
cs.AI

摘要

智能手機已成為現代生活中不可或缺的一部分,然而在移動設備上處理複雜任務通常仍然令人沮喪。基於大型多模型(LMM)的移動代理的最新進展已經證明了其在感知和行動方面在移動環境中的能力。然而,目前的方法面臨著重大限制:它們在應對現實世界人類需求方面表現不佳,難以應對需要推理和長期規劃的任務,並且缺乏從先前經驗中學習和改進的機制。為了克服這些挑戰,我們引入了Mobile-Agent-E,這是一個能夠通過過去經驗自我進化的分層多代理框架。所謂分層,是指將高層次計劃和低層次行動執行明確區分開來。該框架包括一個經理,負責通過將複雜任務分解為子目標來制定整體計劃,以及四個下級代理——感知器、操作器、行動反射器和記錄員——分別處理細粒度的視覺感知、即時行動執行、錯誤驗證和信息聚合。Mobile-Agent-E還具有一個新穎的自我進化模塊,該模塊維護一個包含提示和快捷方式的持久長期記憶。提示是關於如何有效與環境互動的一般指導和從先前任務中學到的教訓。快捷方式是針對特定子程序量身定制的可重用的可執行原子操作序列。提示和快捷方式的加入有助於在性能和效率上持續改進。除了這個框架,我們還引入了Mobile-Eval-E,這是一個新的基準測試,包括需要長期規劃和多應用程序交互的複雜移動任務。實證結果顯示,Mobile-Agent-E在三個基礎模型骨幹上實現了比先前最先進方法高出22%的絕對改進。項目頁面:https://x-plug.github.io/MobileAgent。
English
Smartphones have become indispensable in modern life, yet navigating complex tasks on mobile devices often remains frustrating. Recent advancements in large multimodal model (LMM)-based mobile agents have demonstrated the ability to perceive and act in mobile environments. However, current approaches face significant limitations: they fall short in addressing real-world human needs, struggle with reasoning-intensive and long-horizon tasks, and lack mechanisms to learn and improve from prior experiences. To overcome these challenges, we introduce Mobile-Agent-E, a hierarchical multi-agent framework capable of self-evolution through past experience. By hierarchical, we mean an explicit separation of high-level planning and low-level action execution. The framework comprises a Manager, responsible for devising overall plans by breaking down complex tasks into subgoals, and four subordinate agents--Perceptor, Operator, Action Reflector, and Notetaker--which handle fine-grained visual perception, immediate action execution, error verification, and information aggregation, respectively. Mobile-Agent-E also features a novel self-evolution module which maintains a persistent long-term memory comprising Tips and Shortcuts. Tips are general guidance and lessons learned from prior tasks on how to effectively interact with the environment. Shortcuts are reusable, executable sequences of atomic operations tailored for specific subroutines. The inclusion of Tips and Shortcuts facilitates continuous refinement in performance and efficiency. Alongside this framework, we introduce Mobile-Eval-E, a new benchmark featuring complex mobile tasks requiring long-horizon, multi-app interactions. Empirical results show that Mobile-Agent-E achieves a 22% absolute improvement over previous state-of-the-art approaches across three foundation model backbones. Project page: https://x-plug.github.io/MobileAgent.

Summary

AI-Generated Summary

PDF292January 22, 2025