ChatPaper.aiChatPaper

《GPT-4V在仙境中:用于零-shot 智能手机GUI导航的大型多模态模型》

GPT-4V in Wonderland: Large Multimodal Models for Zero-Shot Smartphone GUI Navigation

November 13, 2023
作者: An Yan, Zhengyuan Yang, Wanrong Zhu, Kevin Lin, Linjie Li, Jianfeng Wang, Jianwei Yang, Yiwu Zhong, Julian McAuley, Jianfeng Gao, Zicheng Liu, Lijuan Wang
cs.AI

摘要

我们提出了MM-Navigator,这是一个基于GPT-4V的智能代理,用于智能手机图形用户界面(GUI)导航任务。MM-Navigator能够像人类用户一样与智能手机屏幕交互,并确定后续操作以完成给定的指令。我们的研究结果表明,大型多模型(LMMs),特别是GPT-4V,通过其先进的屏幕解释、行动推理和精确的行动定位能力,在零-shot GUI导航方面表现出色。我们首先在我们收集的iOS屏幕数据集上对MM-Navigator进行基准测试。根据人类评估,该系统在生成合理的操作描述方面的准确率达到91%,在iOS上执行单步指令的正确操作的准确率为75%。此外,我们还在Android屏幕导航数据集的子集上评估了该模型,在零-shot方式下超越了先前的GUI导航器。我们的基准测试和详细分析旨在为未来GUI导航任务的研究奠定坚实基础。项目页面位于https://github.com/zzxslp/MM-Navigator。
English
We present MM-Navigator, a GPT-4V-based agent for the smartphone graphical user interface (GUI) navigation task. MM-Navigator can interact with a smartphone screen as human users, and determine subsequent actions to fulfill given instructions. Our findings demonstrate that large multimodal models (LMMs), specifically GPT-4V, excel in zero-shot GUI navigation through its advanced screen interpretation, action reasoning, and precise action localization capabilities. We first benchmark MM-Navigator on our collected iOS screen dataset. According to human assessments, the system exhibited a 91\% accuracy rate in generating reasonable action descriptions and a 75\% accuracy rate in executing the correct actions for single-step instructions on iOS. Additionally, we evaluate the model on a subset of an Android screen navigation dataset, where the model outperforms previous GUI navigators in a zero-shot fashion. Our benchmark and detailed analyses aim to lay a robust groundwork for future research into the GUI navigation task. The project page is at https://github.com/zzxslp/MM-Navigator.
PDF151December 15, 2024