ChatPaper.aiChatPaper

移动代理E:用于复杂任务的自进化移动助手

Mobile-Agent-E: Self-Evolving Mobile Assistant for Complex Tasks

January 20, 2025
作者: Zhenhailong Wang, Haiyang Xu, Junyang Wang, Xi Zhang, Ming Yan, Ji Zhang, Fei Huang, Heng Ji
cs.AI

摘要

智能手机已成为现代生活中不可或缺的一部分,然而在移动设备上执行复杂任务通常仍然令人沮丧。基于大型多模态模型(LMM)的移动代理的最新进展表明其能够感知和行动于移动环境中。然而,当前方法面临着重大限制:它们在满足真实世界人类需求方面表现不佳,难以处理需要推理和长期规划的任务,并且缺乏从先前经验中学习和改进的机制。为了克服这些挑战,我们引入了Mobile-Agent-E,这是一个能够通过过往经验进行自我演化的分层多代理框架。所谓分层,是指明确区分高层规划和低层执行行动。该框架包括一个负责将复杂任务分解为子目标从而制定整体计划的Manager,以及四个下属代理——Perceptor、Operator、Action Reflector和Notetaker——分别负责处理细粒度视觉感知、即时行动执行、错误验证和信息聚合。Mobile-Agent-E还具有一个新颖的自我演化模块,维护着一个包含提示和快捷方式的持久长期记忆。提示是关于如何有效地与环境进行交互的一般指导和从先前任务中学到的经验教训。快捷方式是针对特定子程序定制的可重复使用的可执行原子操作序列。提示和快捷方式的引入有助于在性能和效率方面持续改进。除了这个框架,我们还介绍了Mobile-Eval-E,一个新的基准测试,包含需要长期规划和多应用程序交互的复杂移动任务。实证结果表明,Mobile-Agent-E在三个基础模型骨干上实现了比先前最先进方法提高了22%的绝对改进。项目页面:https://x-plug.github.io/MobileAgent。
English
Smartphones have become indispensable in modern life, yet navigating complex tasks on mobile devices often remains frustrating. Recent advancements in large multimodal model (LMM)-based mobile agents have demonstrated the ability to perceive and act in mobile environments. However, current approaches face significant limitations: they fall short in addressing real-world human needs, struggle with reasoning-intensive and long-horizon tasks, and lack mechanisms to learn and improve from prior experiences. To overcome these challenges, we introduce Mobile-Agent-E, a hierarchical multi-agent framework capable of self-evolution through past experience. By hierarchical, we mean an explicit separation of high-level planning and low-level action execution. The framework comprises a Manager, responsible for devising overall plans by breaking down complex tasks into subgoals, and four subordinate agents--Perceptor, Operator, Action Reflector, and Notetaker--which handle fine-grained visual perception, immediate action execution, error verification, and information aggregation, respectively. Mobile-Agent-E also features a novel self-evolution module which maintains a persistent long-term memory comprising Tips and Shortcuts. Tips are general guidance and lessons learned from prior tasks on how to effectively interact with the environment. Shortcuts are reusable, executable sequences of atomic operations tailored for specific subroutines. The inclusion of Tips and Shortcuts facilitates continuous refinement in performance and efficiency. Alongside this framework, we introduce Mobile-Eval-E, a new benchmark featuring complex mobile tasks requiring long-horizon, multi-app interactions. Empirical results show that Mobile-Agent-E achieves a 22% absolute improvement over previous state-of-the-art approaches across three foundation model backbones. Project page: https://x-plug.github.io/MobileAgent.

Summary

AI-Generated Summary

PDF292January 22, 2025