利用大型视觉语言模型学习文本的视觉性

摘要

视觉文本在人的脑海中唤起图像，而非视觉文本则无法做到。自动检测文本中视觉性的方法将有助于为文本添加相关图像，因为神经文本到图像生成和检索模型的操作基于这样一个内在假设，即输入文本具有视觉性质。我们策划了一个包含3,620个英语句子及其由多个人类标注者提供的视觉性评分的数据集。此外，我们利用包含文本和视觉资产的文档创建了一个远程监督的文档文本和相关图像语料库。我们还提出了一种微调策略，该策略调整了像CLIP这样的大型视觉-语言模型，该模型假设文本和图像之间存在一对一的对应关系，以便将文本的视觉性评分从仅文本输入中进行评分。我们的策略涉及修改模型的对比学习目标，将被识别为非视觉的文本映射到一个通用的空图像，同时将视觉文本与文档中对应的图像进行匹配。我们评估了所提出方法的能力，即(i) 准确分类视觉和非视觉文本，以及(ii) 对被认定为视觉的单词进行关注的心理语言学研究。实证评估表明，我们的方法在所提出的任务上表现优于几种启发式方法和基线模型。此外，为了突显建模文本视觉性的重要性，我们对像DALL-E这样的文本到图像生成系统进行了定性分析。

English

Visual text evokes an image in a person's mind, while non-visual text fails to do so. A method to automatically detect visualness in text will unlock the ability to augment text with relevant images, as neural text-to-image generation and retrieval models operate on the implicit assumption that the input text is visual in nature. We curate a dataset of 3,620 English sentences and their visualness scores provided by multiple human annotators. Additionally, we use documents that contain text and visual assets to create a distantly supervised corpus of document text and associated images. We also propose a fine-tuning strategy that adapts large vision-language models like CLIP that assume a one-to-one correspondence between text and image to the task of scoring text visualness from text input alone. Our strategy involves modifying the model's contrastive learning objective to map text identified as non-visual to a common NULL image while matching visual text to their corresponding images in the document. We evaluate the proposed approach on its ability to (i) classify visual and non-visual text accurately, and (ii) attend over words that are identified as visual in psycholinguistic studies. Empirical evaluation indicates that our approach performs better than several heuristics and baseline models for the proposed task. Furthermore, to highlight the importance of modeling the visualness of text, we conduct qualitative analyses of text-to-image generation systems like DALL-E.

利用大型视觉语言模型学习文本的视觉性

Learning the Visualness of Text Using Large Vision-Language Models

摘要

Support