Mobile-Agent-v2: マルチエージェント協調による効率的なナビゲーションを備えたモバイルデバイス操作アシスタント

要旨

モバイルデバイス操作タスクは、ますます人気のあるマルチモーダルAIアプリケーションシナリオとなっています。現在のマルチモーダル大規模言語モデル（MLLM）は、その学習データに制約され、効果的な操作アシスタントとして機能する能力を欠いています。代わりに、ツール呼び出しを通じて能力を強化するMLLMベースのエージェントが、このシナリオに徐々に適用されています。しかし、モバイルデバイス操作タスクにおける2つの主要なナビゲーション課題、タスク進行ナビゲーションとフォーカスコンテンツナビゲーションは、既存のシングルエージェントアーキテクチャの下で大幅に複雑化しています。これは、過度に長いトークンシーケンスとテキストと画像が交互に現れるデータ形式が性能を制限しているためです。これらのナビゲーション課題を効果的に解決するために、我々はモバイルデバイス操作支援のためのマルチエージェントアーキテクチャであるMobile-Agent-v2を提案します。このアーキテクチャは、計画エージェント、決定エージェント、およびリフレクションエージェントの3つのエージェントで構成されています。計画エージェントはタスクの進行を生成し、過去の操作のナビゲーションをより効率的にします。フォーカスコンテンツを保持するために、タスクの進行に応じて更新されるメモリユニットを設計しました。さらに、誤った操作を修正するために、リフレクションエージェントは各操作の結果を観察し、それに応じてミスを処理します。実験結果は、Mobile-Agent-v2がMobile-Agentのシングルエージェントアーキテクチャと比較して、タスク完了率で30％以上の向上を達成することを示しています。コードはhttps://github.com/X-PLUG/MobileAgentでオープンソース化されています。

English

Mobile device operation tasks are increasingly becoming a popular multi-modal AI application scenario. Current Multi-modal Large Language Models (MLLMs), constrained by their training data, lack the capability to function effectively as operation assistants. Instead, MLLM-based agents, which enhance capabilities through tool invocation, are gradually being applied to this scenario. However, the two major navigation challenges in mobile device operation tasks, task progress navigation and focus content navigation, are significantly complicated under the single-agent architecture of existing work. This is due to the overly long token sequences and the interleaved text-image data format, which limit performance. To address these navigation challenges effectively, we propose Mobile-Agent-v2, a multi-agent architecture for mobile device operation assistance. The architecture comprises three agents: planning agent, decision agent, and reflection agent. The planning agent generates task progress, making the navigation of history operations more efficient. To retain focus content, we design a memory unit that updates with task progress. Additionally, to correct erroneous operations, the reflection agent observes the outcomes of each operation and handles any mistakes accordingly. Experimental results indicate that Mobile-Agent-v2 achieves over a 30% improvement in task completion compared to the single-agent architecture of Mobile-Agent. The code is open-sourced at https://github.com/X-PLUG/MobileAgent.

Mobile-Agent-v2: マルチエージェント協調による効率的なナビゲーションを備えたモバイルデバイス操作アシスタント

Mobile-Agent-v2: Mobile Device Operation Assistant with Effective Navigation via Multi-Agent Collaboration

要旨

Support