GLaMM: ピクセル接地型大規模マルチモーダルモデル

要旨

大規模マルチモーダルモデル（LMMs）は、大規模言語モデルを視覚領域に拡張したものです。初期のLMMsの取り組みでは、全体画像とテキストプロンプトを使用して、根拠のないテキスト応答を生成していました。ごく最近では、領域レベルのLMMsが視覚的に根拠のある応答を生成するために使用されています。しかし、これらのモデルは一度に単一のオブジェクトカテゴリのみを参照するか、ユーザーが入力で領域を指定する必要があるか、または密なピクセル単位のオブジェクト接地を提供できないという制限があります。本研究では、対応するオブジェクトセグメンテーションマスクとシームレスに絡み合った自然言語応答を生成できる最初のモデルであるGrounding LMM（GLaMM）を紹介します。GLaMMは、会話に現れるオブジェクトを接地するだけでなく、テキストとオプションの視覚プロンプト（関心領域）の両方を入力として受け入れる柔軟性を持っています。これにより、ユーザーはテキストと視覚の両方の領域で、さまざまな粒度レベルでモデルと対話することができます。視覚的に根拠のある詳細な会話を生成するという新しい設定のための標準的なベンチマークが不足しているため、私たちは独自に作成した接地会話を用いた包括的な評価プロトコルを導入します。私たちが提案するGrounded Conversation Generation（GCG）タスクは、大規模な自然シーンにおける密な接地概念を必要とします。この目的のために、私たちは自動注釈パイプラインを使用して、セグメンテーションマスクが利用可能な810Mの領域に接地された7.5Mのユニークな概念を含む密に注釈されたGrounding-anything Dataset（GranD）を提案します。GCGに加えて、GLaMMは参照表現セグメンテーション、画像および領域レベルのキャプション生成、視覚言語会話など、いくつかの下流タスクでも効果的に機能します。プロジェクトページ: https://mbzuai-oryx.github.io/groundingLMM。

English

Large Multimodal Models (LMMs) extend Large Language Models to the vision domain. Initial efforts towards LMMs used holistic images and text prompts to generate ungrounded textual responses. Very recently, region-level LMMs have been used to generate visually grounded responses. However, they are limited to only referring a single object category at a time, require users to specify the regions in inputs, or cannot offer dense pixel-wise object grounding. In this work, we present Grounding LMM (GLaMM), the first model that can generate natural language responses seamlessly intertwined with corresponding object segmentation masks. GLaMM not only grounds objects appearing in the conversations but is flexible enough to accept both textual and optional visual prompts (region of interest) as input. This empowers users to interact with the model at various levels of granularity, both in textual and visual domains. Due to the lack of standard benchmarks for the novel setting of generating visually grounded detailed conversations, we introduce a comprehensive evaluation protocol with our curated grounded conversations. Our proposed Grounded Conversation Generation (GCG) task requires densely grounded concepts in natural scenes at a large-scale. To this end, we propose a densely annotated Grounding-anything Dataset (GranD) using our proposed automated annotation pipeline that encompasses 7.5M unique concepts grounded in a total of 810M regions available with segmentation masks. Besides GCG, GLaMM also performs effectively on several downstream tasks e.g., referring expression segmentation, image and region-level captioning and vision-language conversations. Project Page: https://mbzuai-oryx.github.io/groundingLMM.

GLaMM: ピクセル接地型大規模マルチモーダルモデル

GLaMM: Pixel Grounding Large Multimodal Model

要旨

Support