Mobile-Agent-v2: 다중 에이전트 협업을 통한 효율적 탐색 기능을 갖춘 모바일 디바이스 운영 보조 시스템

초록

모바일 디바이스 운영 작업은 점점 더 인기 있는 멀티모달 AI 애플리케이션 시나리오로 자리 잡고 있습니다. 현재의 멀티모달 대형 언어 모델(MLLMs)은 학습 데이터의 제약으로 인해 운영 보조자로서 효과적으로 기능하는 능력이 부족합니다. 대신, 도구 호출을 통해 능력을 강화한 MLLM 기반 에이전트들이 점차 이 시나리오에 적용되고 있습니다. 그러나 모바일 디바이스 운영 작업에서의 두 가지 주요 내비게이션 문제인 작업 진행 내비게이션과 포커스 콘텐츠 내비게이션은 기존 작업의 단일 에이전트 아키텍처 하에서 상당히 복잡해집니다. 이는 지나치게 긴 토큰 시퀀스와 텍스트-이미지 데이터 형식의 교차로 인해 성능이 제한되기 때문입니다. 이러한 내비게이션 문제를 효과적으로 해결하기 위해, 우리는 모바일 디바이스 운영 지원을 위한 멀티 에이전트 아키텍처인 Mobile-Agent-v2를 제안합니다. 이 아키텍처는 계획 에이전트, 결정 에이전트, 반성 에이전트로 구성됩니다. 계획 에이전트는 작업 진행을 생성하여 이전 작업 내비게이션을 더 효율적으로 만듭니다. 포커스 콘텐츠를 유지하기 위해, 우리는 작업 진행에 따라 업데이트되는 메모리 유닛을 설계했습니다. 또한, 잘못된 작업을 수정하기 위해 반성 에이전트는 각 작업의 결과를 관찰하고 그에 따라 오류를 처리합니다. 실험 결과, Mobile-Agent-v2는 Mobile-Agent의 단일 에이전트 아키텍처에 비해 작업 완료율에서 30% 이상의 향상을 달성했습니다. 코드는 https://github.com/X-PLUG/MobileAgent에서 오픈소스로 제공됩니다.

English

Mobile device operation tasks are increasingly becoming a popular multi-modal AI application scenario. Current Multi-modal Large Language Models (MLLMs), constrained by their training data, lack the capability to function effectively as operation assistants. Instead, MLLM-based agents, which enhance capabilities through tool invocation, are gradually being applied to this scenario. However, the two major navigation challenges in mobile device operation tasks, task progress navigation and focus content navigation, are significantly complicated under the single-agent architecture of existing work. This is due to the overly long token sequences and the interleaved text-image data format, which limit performance. To address these navigation challenges effectively, we propose Mobile-Agent-v2, a multi-agent architecture for mobile device operation assistance. The architecture comprises three agents: planning agent, decision agent, and reflection agent. The planning agent generates task progress, making the navigation of history operations more efficient. To retain focus content, we design a memory unit that updates with task progress. Additionally, to correct erroneous operations, the reflection agent observes the outcomes of each operation and handles any mistakes accordingly. Experimental results indicate that Mobile-Agent-v2 achieves over a 30% improvement in task completion compared to the single-agent architecture of Mobile-Agent. The code is open-sourced at https://github.com/X-PLUG/MobileAgent.

Mobile-Agent-v2: 다중 에이전트 협업을 통한 효율적 탐색 기능을 갖춘 모바일 디바이스 운영 보조 시스템

Mobile-Agent-v2: Mobile Device Operation Assistant with Effective Navigation via Multi-Agent Collaboration

초록

Support