Ferret-UI: 멀티모달 LLM을 활용한 모바일 UI의 근거 기반 이해

초록

최근 멀티모달 대형 언어 모델(MLLM)의 발전은 주목할 만하지만, 이러한 일반 도메인 MLLM은 사용자 인터페이스(UI) 화면을 효과적으로 이해하고 상호작용하는 데 있어서 종종 한계를 보입니다. 본 논문에서는 모바일 UI 화면에 대한 향상된 이해를 위해 특화된 새로운 MLLM인 Ferret-UI를 소개합니다. 이 모델은 참조, 기반 설정, 추론 능력을 갖추고 있습니다. UI 화면은 일반적으로 자연 이미지보다 더 길쭉한 화면 비율과 더 작은 관심 객체(예: 아이콘, 텍스트)를 포함하므로, Ferret 위에 "어떤 해상도"를 통합하여 세부 사항을 확대하고 향상된 시각적 특징을 활용합니다. 구체적으로, 각 화면은 원래의 화면 비율에 따라 2개의 하위 이미지로 나뉩니다(즉, 세로 화면의 경우 가로 분할, 가로 화면의 경우 세로 분할). 두 하위 이미지는 별도로 인코딩된 후 LLM으로 전송됩니다. 우리는 아이콘 인식, 텍스트 찾기, 위젯 목록 작성과 같은 다양한 기본 UI 작업에서 훈련 샘플을 꼼꼼하게 수집합니다. 이러한 샘플은 정확한 참조와 기반 설정을 용이하게 하기 위해 영역 주석이 포함된 지시 따르기 형식으로 구성됩니다. 모델의 추론 능력을 강화하기 위해, 상세 설명, 인식/상호작용 대화, 기능 추론을 포함한 고급 작업을 위한 데이터셋을 추가로 구축합니다. 선별된 데이터셋으로 훈련한 후, Ferret-UI는 UI 화면에 대한 탁월한 이해력과 개방형 지시를 실행할 수 있는 능력을 보여줍니다. 모델 평가를 위해, 앞서 언급한 모든 작업을 포함한 포괄적인 벤치마크를 설정합니다. Ferret-UI는 대부분의 오픈소스 UI MLLM을 능가할 뿐만 아니라, 모든 기본 UI 작업에서 GPT-4V를 초과하는 성능을 보입니다.

English

Recent advancements in multimodal large language models (MLLMs) have been noteworthy, yet, these general-domain MLLMs often fall short in their ability to comprehend and interact effectively with user interface (UI) screens. In this paper, we present Ferret-UI, a new MLLM tailored for enhanced understanding of mobile UI screens, equipped with referring, grounding, and reasoning capabilities. Given that UI screens typically exhibit a more elongated aspect ratio and contain smaller objects of interest (e.g., icons, texts) than natural images, we incorporate "any resolution" on top of Ferret to magnify details and leverage enhanced visual features. Specifically, each screen is divided into 2 sub-images based on the original aspect ratio (i.e., horizontal division for portrait screens and vertical division for landscape screens). Both sub-images are encoded separately before being sent to LLMs. We meticulously gather training samples from an extensive range of elementary UI tasks, such as icon recognition, find text, and widget listing. These samples are formatted for instruction-following with region annotations to facilitate precise referring and grounding. To augment the model's reasoning ability, we further compile a dataset for advanced tasks, including detailed description, perception/interaction conversations, and function inference. After training on the curated datasets, Ferret-UI exhibits outstanding comprehension of UI screens and the capability to execute open-ended instructions. For model evaluation, we establish a comprehensive benchmark encompassing all the aforementioned tasks. Ferret-UI excels not only beyond most open-source UI MLLMs, but also surpasses GPT-4V on all the elementary UI tasks.

Ferret-UI: 멀티모달 LLM을 활용한 모바일 UI의 근거 기반 이해

Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs

초록

Support