Ferret-UI：利用多模式LLM實現基於移動UI的理解

摘要

近期在多模態大型語言模型（MLLMs）方面的進展引人注目，然而，這些通用領域的 MLLMs 往往在理解和有效與使用者界面（UI）屏幕互動方面表現不佳。本文介紹了 Ferret-UI，這是一種針對增強對移動 UI 屏幕理解而量身定制的新型 MLLM，具備指代、基礎和推理能力。鑒於 UI 屏幕通常呈現更長的寬高比，並包含比自然圖像更小的感興趣對象（例如圖標、文本），我們在 Ferret 上增加了“任意分辨率”功能，以放大細節並利用增強的視覺特徵。具體而言，根據原始寬高比將每個屏幕分為 2 個子圖像（即，對於豎屏，進行水平分割，對於橫屏，進行垂直分割）。在發送到 LLMs 之前，這兩個子圖像將分別進行編碼。我們精心從廣泛的基本 UI 任務中收集訓練樣本，例如圖標識別、查找文本和小部件列舉。這些樣本被格式化為帶有區域標註的指示以促進準確的指代和基礎。為了增強模型的推理能力，我們進一步編制了一個用於高級任務的數據集，包括詳細描述、感知/互動對話和功能推斷。在精心策劃的數據集上訓練後，Ferret-UI 展現出對 UI 屏幕的出色理解能力和執行開放式指示的能力。為了對模型進行評估，我們建立了一個包含所有上述任務的全面基準。Ferret-UI 不僅在大多數開源 UI MLLMs 方面表現優異，而且在所有基本 UI 任務上均超越了 GPT-4V。

English

Recent advancements in multimodal large language models (MLLMs) have been noteworthy, yet, these general-domain MLLMs often fall short in their ability to comprehend and interact effectively with user interface (UI) screens. In this paper, we present Ferret-UI, a new MLLM tailored for enhanced understanding of mobile UI screens, equipped with referring, grounding, and reasoning capabilities. Given that UI screens typically exhibit a more elongated aspect ratio and contain smaller objects of interest (e.g., icons, texts) than natural images, we incorporate "any resolution" on top of Ferret to magnify details and leverage enhanced visual features. Specifically, each screen is divided into 2 sub-images based on the original aspect ratio (i.e., horizontal division for portrait screens and vertical division for landscape screens). Both sub-images are encoded separately before being sent to LLMs. We meticulously gather training samples from an extensive range of elementary UI tasks, such as icon recognition, find text, and widget listing. These samples are formatted for instruction-following with region annotations to facilitate precise referring and grounding. To augment the model's reasoning ability, we further compile a dataset for advanced tasks, including detailed description, perception/interaction conversations, and function inference. After training on the curated datasets, Ferret-UI exhibits outstanding comprehension of UI screens and the capability to execute open-ended instructions. For model evaluation, we establish a comprehensive benchmark encompassing all the aforementioned tasks. Ferret-UI excels not only beyond most open-source UI MLLMs, but also surpasses GPT-4V on all the elementary UI tasks.

Ferret-UI：利用多模式LLM實現基於移動UI的理解

Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs

摘要

Support