SITTA: 画像キャプショニングのための意味的画像-テキストアラインメント

要旨

画像のテキスト的および意味的理解は、適切なキャプションを生成するために不可欠である。この理解には、物体の検出、それらの間の関係のモデリング、シーンの意味論的評価、そして最終的に抽出された知識を言語空間で表現することが必要となる。豊かな言語能力を確保しつつ良好な画像-言語マッピングを実現するために、事前学習済み言語モデル（LM）は、画像入力を可能にする事前学習済みマルチモーダル（画像-テキスト）モデルに条件付けられた。これには、マルチモーダルモデルの画像表現と生成型LMの言語表現のアラインメントが必要である。しかし、マルチモーダルモデルの視覚エンコーダによって検出された意味をLMに最適に転送する方法は明確ではない。我々は、2つの事前学習済みモデルの埋め込み空間間で意味を成功裏に転送する線形マッピングを構築する2つの新しい方法を紹介する。1つ目は、マルチモーダル言語エンコーダの埋め込み空間を事前学習済みLMの埋め込み空間とトークン対応関係によってアラインメントする方法である。後者は、画像-テキストペアで構成される追加データを活用して、視覚から言語空間へのマッピングを直接構築する。我々の意味マッピングを使用することで、勾配情報にアクセスすることなくLMの画像キャプション生成を可能にする。異なるデータソースを使用することで、MS-COCOおよびFlickr30kデータセットで強力なキャプション生成性能を達成する。限られたデータに直面しても、我々の方法は他のゼロショットおよびファインチューニングされた競合モデルの性能を部分的に上回る。我々のアブレーション研究は、わずか250Mパラメータの規模のLMでも、我々の意味マッピングを使用して良好なキャプションを生成できることを示している。我々のアプローチは、計算リソースが制限された機関にとって画像キャプション生成をよりアクセスしやすくする。

English

Textual and semantic comprehension of images is essential for generating proper captions. The comprehension requires detection of objects, modeling of relations between them, an assessment of the semantics of the scene and, finally, representing the extracted knowledge in a language space. To achieve rich language capabilities while ensuring good image-language mappings, pretrained language models (LMs) were conditioned on pretrained multi-modal (image-text) models that allow for image inputs. This requires an alignment of the image representation of the multi-modal model with the language representations of a generative LM. However, it is not clear how to best transfer semantics detected by the vision encoder of the multi-modal model to the LM. We introduce two novel ways of constructing a linear mapping that successfully transfers semantics between the embedding spaces of the two pretrained models. The first aligns the embedding space of the multi-modal language encoder with the embedding space of the pretrained LM via token correspondences. The latter leverages additional data that consists of image-text pairs to construct the mapping directly from vision to language space. Using our semantic mappings, we unlock image captioning for LMs without access to gradient information. By using different sources of data we achieve strong captioning performance on MS-COCO and Flickr30k datasets. Even in the face of limited data, our method partly exceeds the performance of other zero-shot and even finetuned competitors. Our ablation studies show that even LMs at a scale of merely 250M parameters can generate decent captions employing our semantic mappings. Our approach makes image captioning more accessible for institutions with restricted computational resources.

SITTA: 画像キャプショニングのための意味的画像-テキストアラインメント

SITTA: A Semantic Image-Text Alignment for Image Captioning

要旨

Support