対照的な局所言語画像事前学習

要旨

対照的な言語-画像事前学習（CLIP）は、ビジョンエンコーダーを訓練して画像/テキスト表現を生成するための優れた方法として賞賛されています。これは、さまざまなアプリケーションを容易にするものです。最近、CLIPは、画像入力を言語の相互作用に接続するために、多モーダル大規模言語モデル（MLLMs）のビジョンバックボーンとして広く採用されています。CLIPの成功は、画像レベルでのウェブクロールされたノイズのあるテキスト注釈を整列させることに依存しています。ただし、このような基準は、特にMLLMsにとって領域レベルの理解が要求される場合など、微細なビジョン表現が必要な下流タスクには不十分である可能性があります。本論文では、CLIPのローカライズ能力を向上させるためにいくつかの進歩を遂げました。私たちは、CLIPを領域-テキストの対照的な損失とモジュールで補完することにより、Contrastive Localized Language-Image Pre-training（CLOC）という事前学習方法を提案します。我々は、新しい概念であるプロンプト可能な埋め込みを定式化しました。このエンコーダは、空間的なヒントを与えられた場合に簡単に領域表現に変換できる画像埋め込みを生成します。大規模な事前学習をサポートするために、視覚的に豊かで空間的に局在したキャプションフレームワークを設計し、効果的にスケールで領域-テキストの疑似ラベルを生成します。数十億の注釈付き画像にスケーリングすることで、CLOCは画像領域認識および検索タスク向けの高品質な領域埋め込みを可能にし、CLIPの代替としてMLLMsを強化し、特に参照および基準タスクで優れた性能を発揮します。

English

Contrastive Language-Image Pre-training (CLIP) has been a celebrated method for training vision encoders to generate image/text representations facilitating various applications. Recently, CLIP has been widely adopted as the vision backbone of multimodal large language models (MLLMs) to connect image inputs for language interactions. The success of CLIP as a vision-language foundation model relies on aligning web-crawled noisy text annotations at image levels. Nevertheless, such criteria may become insufficient for downstream tasks in need of fine-grained vision representations, especially when region-level understanding is demanding for MLLMs. In this paper, we improve the localization capability of CLIP with several advances. We propose a pre-training method called Contrastive Localized Language-Image Pre-training (CLOC) by complementing CLIP with region-text contrastive loss and modules. We formulate a new concept, promptable embeddings, of which the encoder produces image embeddings easy to transform into region representations given spatial hints. To support large-scale pre-training, we design a visually-enriched and spatially-localized captioning framework to effectively generate region-text pseudo-labels at scale. By scaling up to billions of annotated images, CLOC enables high-quality regional embeddings for image region recognition and retrieval tasks, and can be a drop-in replacement of CLIP to enhance MLLMs, especially on referring and grounding tasks.

対照的な局所言語画像事前学習

Contrastive Localized Language-Image Pre-Training

要旨

Support