ChatPaper.aiChatPaper

GPT-4V在仙境中:用於零-shot 智能手機 GUI 導航的大型多模型

GPT-4V in Wonderland: Large Multimodal Models for Zero-Shot Smartphone GUI Navigation

November 13, 2023
作者: An Yan, Zhengyuan Yang, Wanrong Zhu, Kevin Lin, Linjie Li, Jianfeng Wang, Jianwei Yang, Yiwu Zhong, Julian McAuley, Jianfeng Gao, Zicheng Liu, Lijuan Wang
cs.AI

摘要

我們提出了MM-Navigator,這是一個基於GPT-4V的智能代理,用於智能手機圖形使用者介面(GUI)導航任務。MM-Navigator能夠像人類使用者一樣與智能手機屏幕互動,並確定後續動作以完成給定的指令。我們的研究結果表明,大型多模型(LMMs),特別是GPT-4V,通過其先進的屏幕解釋、動作推理和精確的動作定位能力,在零-shot GUI導航方面表現優異。我們首先在我們收集的iOS屏幕數據集上對MM-Navigator進行基準測試。根據人類評估,系統在生成合理的動作描述方面表現出91\%的準確率,在iOS上對單步指令執行正確動作的準確率為75\%。此外,我們在Android屏幕導航數據集的子集上評估了模型,在零-shot方式下超越了先前的GUI導航器。我們的基準測試和詳細分析旨在為未來GUI導航任務的研究奠定堅實基礎。項目頁面位於https://github.com/zzxslp/MM-Navigator。
English
We present MM-Navigator, a GPT-4V-based agent for the smartphone graphical user interface (GUI) navigation task. MM-Navigator can interact with a smartphone screen as human users, and determine subsequent actions to fulfill given instructions. Our findings demonstrate that large multimodal models (LMMs), specifically GPT-4V, excel in zero-shot GUI navigation through its advanced screen interpretation, action reasoning, and precise action localization capabilities. We first benchmark MM-Navigator on our collected iOS screen dataset. According to human assessments, the system exhibited a 91\% accuracy rate in generating reasonable action descriptions and a 75\% accuracy rate in executing the correct actions for single-step instructions on iOS. Additionally, we evaluate the model on a subset of an Android screen navigation dataset, where the model outperforms previous GUI navigators in a zero-shot fashion. Our benchmark and detailed analyses aim to lay a robust groundwork for future research into the GUI navigation task. The project page is at https://github.com/zzxslp/MM-Navigator.
PDF151December 15, 2024