GLaMM:像素定位大型多模态模型
GLaMM: Pixel Grounding Large Multimodal Model
November 6, 2023
作者: Hanoona Rasheed, Muhammad Maaz, Sahal Shaji, Abdelrahman Shaker, Salman Khan, Hisham Cholakkal, Rao M. Anwer, Erix Xing, Ming-Hsuan Yang, Fahad S. Khan
cs.AI
摘要
大型多模态模型(LMMs)将大型语言模型扩展到视觉领域。最初针对LMMs的努力使用整体图像和文本提示生成未接地的文本响应。最近,区域级LMMs已被用于生成视觉接地的响应。然而,它们仅限于一次引用单个对象类别,需要用户在输入中指定区域,或无法提供密集的像素级对象接地。在这项工作中,我们提出了接地LMM(GLaMM),这是第一个可以生成自然语言响应并与相应对象分割掩模无缝交织的模型。GLaMM不仅接地出现在对话中的对象,而且足够灵活,可以接受文本和可选视觉提示(感兴趣区域)作为输入。这使用户能够以各种粒度在文本和视觉领域与模型进行交互。由于缺乏用于生成具有视觉接地详细对话的新颖设置的标准基准,我们引入了一个包含我们精心策划的接地对话的全面评估协议。我们提出的接地对话生成(GCG)任务要求在大规模自然场景中密集接地的概念。为此,我们提出了一个密集注释的接地任何事物数据集(GranD),使用我们提出的自动注释流程,其中包含810M个区域的总共750万个独特概念。除了GCG,GLaMM还在几个下游任务上表现出色,例如指代表达分割、图像和区域级字幕以及视觉语言对话。项目页面:https://mbzuai-oryx.github.io/groundingLMM。
English
Large Multimodal Models (LMMs) extend Large Language Models to the vision
domain. Initial efforts towards LMMs used holistic images and text prompts to
generate ungrounded textual responses. Very recently, region-level LMMs have
been used to generate visually grounded responses. However, they are limited to
only referring a single object category at a time, require users to specify the
regions in inputs, or cannot offer dense pixel-wise object grounding. In this
work, we present Grounding LMM (GLaMM), the first model that can generate
natural language responses seamlessly intertwined with corresponding object
segmentation masks. GLaMM not only grounds objects appearing in the
conversations but is flexible enough to accept both textual and optional visual
prompts (region of interest) as input. This empowers users to interact with the
model at various levels of granularity, both in textual and visual domains. Due
to the lack of standard benchmarks for the novel setting of generating visually
grounded detailed conversations, we introduce a comprehensive evaluation
protocol with our curated grounded conversations. Our proposed Grounded
Conversation Generation (GCG) task requires densely grounded concepts in
natural scenes at a large-scale. To this end, we propose a densely annotated
Grounding-anything Dataset (GranD) using our proposed automated annotation
pipeline that encompasses 7.5M unique concepts grounded in a total of 810M
regions available with segmentation masks. Besides GCG, GLaMM also performs
effectively on several downstream tasks e.g., referring expression
segmentation, image and region-level captioning and vision-language
conversations. Project Page: https://mbzuai-oryx.github.io/groundingLMM.