画像-テキスト事前学習における細粒度理解の向上

要旨

SPARse Fine-grained Contrastive Alignment (SPARC)を紹介する。これは、画像とテキストのペアからより細かい粒度のマルチモーダル表現を事前学習するためのシンプルな手法である。複数の画像パッチが単一の単語に対応することが多いことを踏まえ、キャプション内の各トークンに対して画像パッチのグループ化を学習することを提案する。これを実現するため、画像パッチと言語トークン間のスパースな類似度メトリックを使用し、各トークンに対して言語グループ化された視覚埋め込みをパッチの重み付き平均として計算する。その後、トークンと言語グループ化された視覚埋め込みを、個々のサンプルにのみ依存し、他のバッチサンプルをネガティブとして必要としない細かい粒度のシーケンス単位の損失関数を通じて対比させる。これにより、計算コストを抑えつつ、より詳細な情報を学習することが可能となる。SPARCは、この細かい粒度の損失関数と、グローバルな画像とテキストの埋め込み間の対比損失を組み合わせることで、グローバルとローカルの情報を同時にエンコードする表現を学習する。提案手法を徹底的に評価し、分類などの粗い粒度の情報に依存する画像レベルのタスクや、検索、物体検出、セグメンテーションなどの細かい粒度の情報に依存する領域レベルのタスクにおいて、競合する手法を上回る性能を示す。さらに、SPARCは基礎的な視覚言語モデルにおけるモデルの忠実性とキャプション生成を改善する。

English

We introduce SPARse Fine-grained Contrastive Alignment (SPARC), a simple method for pretraining more fine-grained multimodal representations from image-text pairs. Given that multiple image patches often correspond to single words, we propose to learn a grouping of image patches for every token in the caption. To achieve this, we use a sparse similarity metric between image patches and language tokens and compute for each token a language-grouped vision embedding as the weighted average of patches. The token and language-grouped vision embeddings are then contrasted through a fine-grained sequence-wise loss that only depends on individual samples and does not require other batch samples as negatives. This enables more detailed information to be learned in a computationally inexpensive manner. SPARC combines this fine-grained loss with a contrastive loss between global image and text embeddings to learn representations that simultaneously encode global and local information. We thoroughly evaluate our proposed method and show improved performance over competing approaches both on image-level tasks relying on coarse-grained information, e.g. classification, as well as region-level tasks relying on fine-grained information, e.g. retrieval, object detection, and segmentation. Moreover, SPARC improves model faithfulness and captioning in foundational vision-language models.

画像-テキスト事前学習における細粒度理解の向上

Improving fine-grained understanding in image-text pre-training

要旨

Support