Ferret-UI:基于多模态LLMs的移动UI理解
Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs
April 8, 2024
作者: Keen You, Haotian Zhang, Eldon Schoop, Floris Weers, Amanda Swearngin, Jeffrey Nichols, Yinfei Yang, Zhe Gan
cs.AI
摘要
最近在多模态大型语言模型(MLLMs)方面取得了显著进展,然而,这些通用领域的MLLMs通常在理解和有效与用户界面(UI)屏幕交互方面表现不佳。在本文中,我们提出了Ferret-UI,这是一种专为增强对移动UI屏幕理解而定制的新型MLLM,具备指代、基准和推理能力。鉴于UI屏幕通常呈现更长的纵横比,并包含比自然图像更小的感兴趣对象(例如图标、文本),我们在Ferret上增加了“任意分辨率”功能,以放大细节并利用增强的视觉特征。具体而言,根据原始纵横比将每个屏幕分为2个子图像(即,纵向分割适用于纵向屏幕,横向分割适用于横向屏幕)。在发送到LLMs之前,这两个子图像分别进行编码。我们从广泛的基本UI任务中精心收集训练样本,例如图标识别、查找文本和小部件列表。这些样本经过格式化,附带区域注释以便于精确指代和基准。为增强模型的推理能力,我们进一步编制了一个包含详细描述、感知/交互对话和功能推断等高级任务的数据集。在经过精心筛选的数据集上训练后,Ferret-UI展现出对UI屏幕的出色理解能力和执行开放式指令的能力。为了评估模型,我们建立了一个全面的基准,涵盖了所有前述任务。Ferret-UI不仅在大多数开源UI MLLMs方面表现优异,而且在所有基本UI任务上均超过了GPT-4V。
English
Recent advancements in multimodal large language models (MLLMs) have been
noteworthy, yet, these general-domain MLLMs often fall short in their ability
to comprehend and interact effectively with user interface (UI) screens. In
this paper, we present Ferret-UI, a new MLLM tailored for enhanced
understanding of mobile UI screens, equipped with referring, grounding, and
reasoning capabilities. Given that UI screens typically exhibit a more
elongated aspect ratio and contain smaller objects of interest (e.g., icons,
texts) than natural images, we incorporate "any resolution" on top of Ferret to
magnify details and leverage enhanced visual features. Specifically, each
screen is divided into 2 sub-images based on the original aspect ratio (i.e.,
horizontal division for portrait screens and vertical division for landscape
screens). Both sub-images are encoded separately before being sent to LLMs. We
meticulously gather training samples from an extensive range of elementary UI
tasks, such as icon recognition, find text, and widget listing. These samples
are formatted for instruction-following with region annotations to facilitate
precise referring and grounding. To augment the model's reasoning ability, we
further compile a dataset for advanced tasks, including detailed description,
perception/interaction conversations, and function inference. After training on
the curated datasets, Ferret-UI exhibits outstanding comprehension of UI
screens and the capability to execute open-ended instructions. For model
evaluation, we establish a comprehensive benchmark encompassing all the
aforementioned tasks. Ferret-UI excels not only beyond most open-source UI
MLLMs, but also surpasses GPT-4V on all the elementary UI tasks.Summary
AI-Generated Summary