GPT-4V의 이상한 나라: 제로샷 스마트폰 GUI 탐색을 위한 대규모 멀티모달 모델

초록

스마트폰 그래픽 사용자 인터페이스(GUI) 탐색 작업을 위한 GPT-4V 기반 에이전트인 MM-Navigator를 소개한다. MM-Navigator는 인간 사용자와 마찬가지로 스마트폰 화면과 상호작용하며, 주어진 지시를 수행하기 위한 후속 동작을 결정할 수 있다. 우리의 연구 결과는 대규모 다중모달 모델(LMM), 특히 GPT-4V가 고급 화면 해석, 동작 추론 및 정확한 동작 위치 지정 능력을 통해 제로샷 GUI 탐색에서 탁월한 성능을 보인다는 것을 입증한다. 먼저, MM-Navigator를 수집한 iOS 화면 데이터셋에서 벤치마킹하였다. 인간 평가에 따르면, 시스템은 iOS에서 단일 단계 지시에 대해 합리적인 동작 설명을 생성하는 데 91%의 정확도를 보였으며, 올바른 동작을 실행하는 데 75%의 정확도를 보였다. 또한, Android 화면 탐색 데이터셋의 하위 집합에서 모델을 평가하였으며, 모델은 제로샷 방식으로 이전 GUI 탐색기를 능가하는 성능을 보였다. 우리의 벤치마크와 상세한 분석은 GUI 탐색 작업에 대한 미래 연구를 위한 견고한 기반을 마련하는 것을 목표로 한다. 프로젝트 페이지는 https://github.com/zzxslp/MM-Navigator에서 확인할 수 있다.

English

We present MM-Navigator, a GPT-4V-based agent for the smartphone graphical user interface (GUI) navigation task. MM-Navigator can interact with a smartphone screen as human users, and determine subsequent actions to fulfill given instructions. Our findings demonstrate that large multimodal models (LMMs), specifically GPT-4V, excel in zero-shot GUI navigation through its advanced screen interpretation, action reasoning, and precise action localization capabilities. We first benchmark MM-Navigator on our collected iOS screen dataset. According to human assessments, the system exhibited a 91\% accuracy rate in generating reasonable action descriptions and a 75\% accuracy rate in executing the correct actions for single-step instructions on iOS. Additionally, we evaluate the model on a subset of an Android screen navigation dataset, where the model outperforms previous GUI navigators in a zero-shot fashion. Our benchmark and detailed analyses aim to lay a robust groundwork for future research into the GUI navigation task. The project page is at https://github.com/zzxslp/MM-Navigator.

GPT-4V의 이상한 나라: 제로샷 스마트폰 GUI 탐색을 위한 대규모 멀티모달 모델

GPT-4V in Wonderland: Large Multimodal Models for Zero-Shot Smartphone GUI Navigation

초록

Support