Mobile-Agent-v2:通过多智能体协作实现有效导航的移动设备操作助手
Mobile-Agent-v2: Mobile Device Operation Assistant with Effective Navigation via Multi-Agent Collaboration
June 3, 2024
作者: Junyang Wang, Haiyang Xu, Haitao Jia, Xi Zhang, Ming Yan, Weizhou Shen, Ji Zhang, Fei Huang, Jitao Sang
cs.AI
摘要
移动设备操作任务越来越成为一个受欢迎的多模态人工智能应用场景。目前的多模态大型语言模型(MLLMs),受其训练数据限制,缺乏作为操作助手有效运作的能力。相反,基于MLLM的代理通过工具调用增强功能,逐渐被应用于这一场景。然而,在移动设备操作任务中存在的两个主要导航挑战,即任务进度导航和焦点内容导航,在现有工作的单一代理架构下变得非常复杂。这是由于过长的标记序列和交错的文本-图像数据格式限制了性能。为了有效解决这些导航挑战,我们提出了Mobile-Agent-v2,一个用于移动设备操作辅助的多代理架构。该架构包括三个代理:规划代理、决策代理和反思代理。规划代理生成任务进度,使历史操作的导航更加高效。为了保持焦点内容,我们设计了一个随任务进度更新的记忆单元。此外,为了纠正错误操作,反思代理观察每次操作的结果并相应处理任何错误。实验结果表明,与Mobile-Agent的单一代理架构相比,Mobile-Agent-v2在任务完成方面实现了超过30%的改善。该代码已在https://github.com/X-PLUG/MobileAgent 开源。
English
Mobile device operation tasks are increasingly becoming a popular multi-modal
AI application scenario. Current Multi-modal Large Language Models (MLLMs),
constrained by their training data, lack the capability to function effectively
as operation assistants. Instead, MLLM-based agents, which enhance capabilities
through tool invocation, are gradually being applied to this scenario. However,
the two major navigation challenges in mobile device operation tasks, task
progress navigation and focus content navigation, are significantly complicated
under the single-agent architecture of existing work. This is due to the overly
long token sequences and the interleaved text-image data format, which limit
performance. To address these navigation challenges effectively, we propose
Mobile-Agent-v2, a multi-agent architecture for mobile device operation
assistance. The architecture comprises three agents: planning agent, decision
agent, and reflection agent. The planning agent generates task progress, making
the navigation of history operations more efficient. To retain focus content,
we design a memory unit that updates with task progress. Additionally, to
correct erroneous operations, the reflection agent observes the outcomes of
each operation and handles any mistakes accordingly. Experimental results
indicate that Mobile-Agent-v2 achieves over a 30% improvement in task
completion compared to the single-agent architecture of Mobile-Agent. The code
is open-sourced at https://github.com/X-PLUG/MobileAgent.Summary
AI-Generated Summary