ChatPaper.aiChatPaper

移動代理:具有視覺感知的自主多模移動裝置代理

Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception

January 29, 2024
作者: Junyang Wang, Haiyang Xu, Jiabo Ye, Ming Yan, Weizhou Shen, Ji Zhang, Fei Huang, Jitao Sang
cs.AI

摘要

基於多模式大型語言模型(MLLM)的移動設備代理已成為一個流行的應用。在本文中,我們介紹了Mobile-Agent,一個自主的多模式移動設備代理。Mobile-Agent首先利用視覺感知工具準確識別和定位應用程序前端界面中的視覺和文本元素。基於感知到的視覺上下文,它自主地規劃和分解複雜的操作任務,並逐步通過操作導航移動應用程序。與先前依賴應用程序的XML文件或移動系統元數據的解決方案不同,Mobile-Agent以一種以視覺為中心的方式,允許在各種移動操作環境中更大的適應性,從而消除了對系統特定定製的必要性。為了評估Mobile-Agent的性能,我們引入了Mobile-Eval,這是一個用於評估移動設備操作的基準。基於Mobile-Eval,我們對Mobile-Agent進行了全面評估。實驗結果表明,Mobile-Agent實現了顯著的準確性和完成率。即使在具有挑戰性的指令下,例如多應用程序操作,Mobile-Agent仍然能夠完成要求。代碼和模型將在https://github.com/X-PLUG/MobileAgent 上開源。
English
Mobile device agent based on Multimodal Large Language Models (MLLM) is becoming a popular application. In this paper, we introduce Mobile-Agent, an autonomous multi-modal mobile device agent. Mobile-Agent first leverages visual perception tools to accurately identify and locate both the visual and textual elements within the app's front-end interface. Based on the perceived vision context, it then autonomously plans and decomposes the complex operation task, and navigates the mobile Apps through operations step by step. Different from previous solutions that rely on XML files of Apps or mobile system metadata, Mobile-Agent allows for greater adaptability across diverse mobile operating environments in a vision-centric way, thereby eliminating the necessity for system-specific customizations. To assess the performance of Mobile-Agent, we introduced Mobile-Eval, a benchmark for evaluating mobile device operations. Based on Mobile-Eval, we conducted a comprehensive evaluation of Mobile-Agent. The experimental results indicate that Mobile-Agent achieved remarkable accuracy and completion rates. Even with challenging instructions, such as multi-app operations, Mobile-Agent can still complete the requirements. Code and model will be open-sourced at https://github.com/X-PLUG/MobileAgent.
PDF214December 15, 2024