ChatPaper.aiChatPaper

Mobile-Agent-v2:通过多智能体协作实现有效导航的移动设备操作助手

Mobile-Agent-v2: Mobile Device Operation Assistant with Effective Navigation via Multi-Agent Collaboration

June 3, 2024
作者: Junyang Wang, Haiyang Xu, Haitao Jia, Xi Zhang, Ming Yan, Weizhou Shen, Ji Zhang, Fei Huang, Jitao Sang
cs.AI

摘要

移动设备操作任务越来越成为一个受欢迎的多模态人工智能应用场景。目前的多模态大型语言模型(MLLMs),受其训练数据限制,缺乏作为操作助手有效运作的能力。相反,基于MLLM的代理通过工具调用增强功能,逐渐被应用于这一场景。然而,在移动设备操作任务中存在的两个主要导航挑战,即任务进度导航和焦点内容导航,在现有工作的单一代理架构下变得非常复杂。这是由于过长的标记序列和交错的文本-图像数据格式限制了性能。为了有效解决这些导航挑战,我们提出了Mobile-Agent-v2,一个用于移动设备操作辅助的多代理架构。该架构包括三个代理:规划代理、决策代理和反思代理。规划代理生成任务进度,使历史操作的导航更加高效。为了保持焦点内容,我们设计了一个随任务进度更新的记忆单元。此外,为了纠正错误操作,反思代理观察每次操作的结果并相应处理任何错误。实验结果表明,与Mobile-Agent的单一代理架构相比,Mobile-Agent-v2在任务完成方面实现了超过30%的改善。该代码已在https://github.com/X-PLUG/MobileAgent 开源。
English
Mobile device operation tasks are increasingly becoming a popular multi-modal AI application scenario. Current Multi-modal Large Language Models (MLLMs), constrained by their training data, lack the capability to function effectively as operation assistants. Instead, MLLM-based agents, which enhance capabilities through tool invocation, are gradually being applied to this scenario. However, the two major navigation challenges in mobile device operation tasks, task progress navigation and focus content navigation, are significantly complicated under the single-agent architecture of existing work. This is due to the overly long token sequences and the interleaved text-image data format, which limit performance. To address these navigation challenges effectively, we propose Mobile-Agent-v2, a multi-agent architecture for mobile device operation assistance. The architecture comprises three agents: planning agent, decision agent, and reflection agent. The planning agent generates task progress, making the navigation of history operations more efficient. To retain focus content, we design a memory unit that updates with task progress. Additionally, to correct erroneous operations, the reflection agent observes the outcomes of each operation and handles any mistakes accordingly. Experimental results indicate that Mobile-Agent-v2 achieves over a 30% improvement in task completion compared to the single-agent architecture of Mobile-Agent. The code is open-sourced at https://github.com/X-PLUG/MobileAgent.

Summary

AI-Generated Summary

PDF352December 12, 2024