大規模視覚言語モデルを用いたテキストの視覚性の学習

要旨

視覚的テキストは人の心にイメージを喚起するが、非視覚的テキストはそうではない。テキストの視覚性を自動的に検出する方法は、関連する画像でテキストを補強する能力を解き放つだろう。なぜなら、ニューラルテキスト画像生成および検索モデルは、入力テキストが視覚的性質を持つという暗黙の前提で動作するからである。我々は、3,620の英語文とそれらの視覚性スコアを複数の人間アノテーターから提供されたデータセットをキュレーションした。さらに、テキストと視覚的アセットを含むドキュメントを使用して、ドキュメントテキストと関連画像の遠隔監視コーパスを作成した。また、テキストと画像の1対1対応を前提とするCLIPのような大規模視覚言語モデルを、テキスト入力のみからテキストの視覚性をスコアリングするタスクに適応させるためのファインチューニング戦略を提案する。我々の戦略は、モデルの対照学習目的を変更し、非視覚的と識別されたテキストを共通のNULL画像にマッピングしながら、視覚的テキストをドキュメント内の対応する画像にマッチングさせることを含む。提案されたアプローチを、(i)視覚的および非視覚的テキストを正確に分類する能力、および(ii)心理言語学研究で視覚的と識別された単語に注意を向ける能力について評価する。実証評価は、提案されたタスクに対して、我々のアプローチがいくつかのヒューリスティックおよびベースラインモデルよりも優れていることを示している。さらに、テキストの視覚性をモデル化することの重要性を強調するために、DALL-Eのようなテキスト画像生成システムの定性分析を行う。

English

Visual text evokes an image in a person's mind, while non-visual text fails to do so. A method to automatically detect visualness in text will unlock the ability to augment text with relevant images, as neural text-to-image generation and retrieval models operate on the implicit assumption that the input text is visual in nature. We curate a dataset of 3,620 English sentences and their visualness scores provided by multiple human annotators. Additionally, we use documents that contain text and visual assets to create a distantly supervised corpus of document text and associated images. We also propose a fine-tuning strategy that adapts large vision-language models like CLIP that assume a one-to-one correspondence between text and image to the task of scoring text visualness from text input alone. Our strategy involves modifying the model's contrastive learning objective to map text identified as non-visual to a common NULL image while matching visual text to their corresponding images in the document. We evaluate the proposed approach on its ability to (i) classify visual and non-visual text accurately, and (ii) attend over words that are identified as visual in psycholinguistic studies. Empirical evaluation indicates that our approach performs better than several heuristics and baseline models for the proposed task. Furthermore, to highlight the importance of modeling the visualness of text, we conduct qualitative analyses of text-to-image generation systems like DALL-E.

大規模視覚言語モデルを用いたテキストの視覚性の学習

Learning the Visualness of Text Using Large Vision-Language Models

要旨

Support