Mobile-Agent: 시각적 인식을 갖춘 자율 다중 모드 모바일 디바이스 에이전트

초록

멀티모달 대형 언어 모델(Multimodal Large Language Models, MLLM) 기반의 모바일 디바이스 에이전트가 인기 있는 애플리케이션으로 떠오르고 있다. 본 논문에서는 자율적인 멀티모달 모바일 디바이스 에이전트인 Mobile-Agent를 소개한다. Mobile-Agent는 먼저 시각 인식 도구를 활용하여 앱 프론트엔드 인터페이스 내의 시각적 및 텍스트 요소를 정확하게 식별하고 위치를 파악한다. 인식된 시각적 맥락을 바탕으로 복잡한 작업을 자율적으로 계획하고 분해하며, 단계별로 모바일 앱을 조작하여 탐색한다. 기존의 앱 XML 파일이나 모바일 시스템 메타데이터에 의존하는 솔루션과 달리, Mobile-Agent는 시각 중심의 방식으로 다양한 모바일 운영 환경에 대한 높은 적응성을 제공함으로써 시스템별 맞춤 설정의 필요성을 없앴다. Mobile-Agent의 성능을 평가하기 위해 모바일 디바이스 작업 평가를 위한 벤치마크인 Mobile-Eval을 도입하였다. Mobile-Eval을 기반으로 Mobile-Agent에 대한 포괄적인 평가를 수행한 결과, Mobile-Agent는 뛰어난 정확도와 완료율을 달성하였다. 다중 앱 작업과 같은 도전적인 지시사항에서도 Mobile-Agent는 요구사항을 완수할 수 있었다. 코드와 모델은 https://github.com/X-PLUG/MobileAgent에서 공개될 예정이다.

English

Mobile device agent based on Multimodal Large Language Models (MLLM) is becoming a popular application. In this paper, we introduce Mobile-Agent, an autonomous multi-modal mobile device agent. Mobile-Agent first leverages visual perception tools to accurately identify and locate both the visual and textual elements within the app's front-end interface. Based on the perceived vision context, it then autonomously plans and decomposes the complex operation task, and navigates the mobile Apps through operations step by step. Different from previous solutions that rely on XML files of Apps or mobile system metadata, Mobile-Agent allows for greater adaptability across diverse mobile operating environments in a vision-centric way, thereby eliminating the necessity for system-specific customizations. To assess the performance of Mobile-Agent, we introduced Mobile-Eval, a benchmark for evaluating mobile device operations. Based on Mobile-Eval, we conducted a comprehensive evaluation of Mobile-Agent. The experimental results indicate that Mobile-Agent achieved remarkable accuracy and completion rates. Even with challenging instructions, such as multi-app operations, Mobile-Agent can still complete the requirements. Code and model will be open-sourced at https://github.com/X-PLUG/MobileAgent.

Mobile-Agent: 시각적 인식을 갖춘 자율 다중 모드 모바일 디바이스 에이전트

Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception

초록

Support