ChatPaper.aiChatPaper

Mobile-Agent-v2:透過多智能體協作實現有效導航的行動裝置操作助手

Mobile-Agent-v2: Mobile Device Operation Assistant with Effective Navigation via Multi-Agent Collaboration

June 3, 2024
作者: Junyang Wang, Haiyang Xu, Haitao Jia, Xi Zhang, Ming Yan, Weizhou Shen, Ji Zhang, Fei Huang, Jitao Sang
cs.AI

摘要

移動裝置操作任務正日益成為一個受歡迎的多模式人工智慧應用場景。目前的多模式大型語言模型(MLLMs),受其訓練數據的限制,缺乏有效地作為操作助手的能力。相反,基於MLLM的代理通過工具調用來增強功能,逐漸應用於此場景。然而,在移動裝置操作任務中存在的兩個主要導航挑戰,即任務進度導航和焦點內容導航,在現有工作的單一代理架構下變得顯著複雜。這是由於過長的令牌序列和交錯的文本-圖像數據格式,這些限制了性能。為了有效應對這些導航挑戰,我們提出了Mobile-Agent-v2,這是一種用於移動裝置操作輔助的多代理架構。該架構包括三個代理:規劃代理、決策代理和反思代理。規劃代理生成任務進度,使歷史操作的導航更加高效。為了保持焦點內容,我們設計了一個隨著任務進度更新的記憶單元。此外,為了糾正錯誤操作,反思代理觀察每個操作的結果並相應處理任何錯誤。實驗結果表明,Mobile-Agent-v2相較於Mobile-Agent的單一代理架構,在任務完成方面實現了超過30%的改善。代碼已在https://github.com/X-PLUG/MobileAgent上開源。
English
Mobile device operation tasks are increasingly becoming a popular multi-modal AI application scenario. Current Multi-modal Large Language Models (MLLMs), constrained by their training data, lack the capability to function effectively as operation assistants. Instead, MLLM-based agents, which enhance capabilities through tool invocation, are gradually being applied to this scenario. However, the two major navigation challenges in mobile device operation tasks, task progress navigation and focus content navigation, are significantly complicated under the single-agent architecture of existing work. This is due to the overly long token sequences and the interleaved text-image data format, which limit performance. To address these navigation challenges effectively, we propose Mobile-Agent-v2, a multi-agent architecture for mobile device operation assistance. The architecture comprises three agents: planning agent, decision agent, and reflection agent. The planning agent generates task progress, making the navigation of history operations more efficient. To retain focus content, we design a memory unit that updates with task progress. Additionally, to correct erroneous operations, the reflection agent observes the outcomes of each operation and handles any mistakes accordingly. Experimental results indicate that Mobile-Agent-v2 achieves over a 30% improvement in task completion compared to the single-agent architecture of Mobile-Agent. The code is open-sourced at https://github.com/X-PLUG/MobileAgent.
PDF352December 12, 2024