利用大型視覺語言模型學習文本的視覺性
Learning the Visualness of Text Using Large Vision-Language Models
May 11, 2023
作者: Gaurav Verma, Ryan A. Rossi, Christopher Tensmeyer, Jiuxiang Gu, Ani Nenkova
cs.AI
摘要
視覺文本能在人的腦海中喚起一幅畫面,而非視覺文本則無法做到。自動檢測文本中視覺特性的方法將有助於為文本添加相關圖像,因為神經文本生成和檢索模型的運作基於一個隱含假設,即輸入文本具有視覺性質。我們整理了一個包含3,620個英文句子及其視覺特性分數的數據集,這些分數由多個人類標註者提供。此外,我們使用包含文本和視覺資產的文件來創建一個遠程監督的文檔文本和相關圖像的語料庫。我們還提出了一種微調策略,該策略將大型視覺語言模型(如CLIP)調整到假定文本和圖像之間存在一對一對應的任務,以便從僅文本輸入中對文本視覺性進行評分。我們的策略涉及修改模型的對比學習目標,將被識別為非視覺的文本映射到一個通用的NULL圖像,同時將視覺文本與文檔中對應的圖像進行匹配。我們評估了所提出方法的能力,包括(i)準確分類視覺和非視覺文本,以及(ii)關注在心理語言學研究中被識別為視覺的單詞。實證評估表明,我們的方法在所提出的任務中表現優於幾種啟發式方法和基準模型。此外,為了突顯對文本視覺性進行建模的重要性,我們對像DALL-E這樣的文本到圖像生成系統進行了定性分析。
English
Visual text evokes an image in a person's mind, while non-visual text fails
to do so. A method to automatically detect visualness in text will unlock the
ability to augment text with relevant images, as neural text-to-image
generation and retrieval models operate on the implicit assumption that the
input text is visual in nature. We curate a dataset of 3,620 English sentences
and their visualness scores provided by multiple human annotators.
Additionally, we use documents that contain text and visual assets to create a
distantly supervised corpus of document text and associated images. We also
propose a fine-tuning strategy that adapts large vision-language models like
CLIP that assume a one-to-one correspondence between text and image to the task
of scoring text visualness from text input alone. Our strategy involves
modifying the model's contrastive learning objective to map text identified as
non-visual to a common NULL image while matching visual text to their
corresponding images in the document. We evaluate the proposed approach on its
ability to (i) classify visual and non-visual text accurately, and (ii) attend
over words that are identified as visual in psycholinguistic studies. Empirical
evaluation indicates that our approach performs better than several heuristics
and baseline models for the proposed task. Furthermore, to highlight the
importance of modeling the visualness of text, we conduct qualitative analyses
of text-to-image generation systems like DALL-E.