GLaMM：像素對應大型多模態模型

摘要

大型多模型（LMM）將大型語言模型擴展到視覺領域。對於LMM的最初努力使用整體圖像和文本提示生成未接地的文本回應。最近，區域級LMM已被用於生成視覺接地的回應。然而，它們僅限於一次僅涉及單個物件類別，需要用戶在輸入中指定區域，或無法提供密集的像素級對象接地。在這項工作中，我們提出了Grounding LMM（GLaMM），這是第一個可以生成自然語言回應並與相應的對象分割遮罩無縫交織的模型。GLaMM不僅將對話中出現的對象接地，而且足夠靈活，可以接受文本和可選視覺提示（感興趣區域）作為輸入。這使用戶能夠在文本和視覺領域的各個層面與模型互動。由於缺乏針對生成視覺接地詳細對話的新設置的標準基準，我們引入了一個包含我們精心策劃的接地對話的全面評估協議。我們提出的接地對話生成（GCG）任務要求在大規模自然場景中密集接地的概念。為此，我們提出了一個密集標註的接地任何數據集（GranD），使用我們提出的自動標註流程，其中包含了810M個區域的810M個區域中接地的750萬個獨特概念。除了GCG，GLaMM還在幾個下游任務上表現出色，例如指代表達分割、圖像和區域級標題以及視覺語言對話。項目頁面：https://mbzuai-oryx.github.io/groundingLMM。

English

Large Multimodal Models (LMMs) extend Large Language Models to the vision domain. Initial efforts towards LMMs used holistic images and text prompts to generate ungrounded textual responses. Very recently, region-level LMMs have been used to generate visually grounded responses. However, they are limited to only referring a single object category at a time, require users to specify the regions in inputs, or cannot offer dense pixel-wise object grounding. In this work, we present Grounding LMM (GLaMM), the first model that can generate natural language responses seamlessly intertwined with corresponding object segmentation masks. GLaMM not only grounds objects appearing in the conversations but is flexible enough to accept both textual and optional visual prompts (region of interest) as input. This empowers users to interact with the model at various levels of granularity, both in textual and visual domains. Due to the lack of standard benchmarks for the novel setting of generating visually grounded detailed conversations, we introduce a comprehensive evaluation protocol with our curated grounded conversations. Our proposed Grounded Conversation Generation (GCG) task requires densely grounded concepts in natural scenes at a large-scale. To this end, we propose a densely annotated Grounding-anything Dataset (GranD) using our proposed automated annotation pipeline that encompasses 7.5M unique concepts grounded in a total of 810M regions available with segmentation masks. Besides GCG, GLaMM also performs effectively on several downstream tasks e.g., referring expression segmentation, image and region-level captioning and vision-language conversations. Project Page: https://mbzuai-oryx.github.io/groundingLMM.

GLaMM：像素對應大型多模態模型

GLaMM: Pixel Grounding Large Multimodal Model

摘要

Support