鹦鹉字幕教会CLIP识别文本

摘要

尽管CLIP是许多视觉-语言应用中的基础模型，但CLIP存在严重的文本定位偏差。这种偏差导致CLIP模型在嵌入图像中的视觉文本时“模仿”，而忽略了真实的视觉语义。我们发现，在最流行的图像-文本数据集LAION-2B中，标题也密集地“模仿”（拼写）图像中嵌入的文本。我们的分析显示，约50\%的图像嵌入了视觉文本内容，它们的约90\%标题或多或少地模仿了视觉文本。基于这样的观察，我们彻底检查了不同版本的CLIP模型，并验证了视觉文本是衡量这些模型的LAION风格图像-文本相似性的主要因素。为了检验这些“模仿”标题是否塑造了文本定位偏差，我们训练了一系列根据不同“模仿”标题导向标准筛选的LAION子集的CLIP模型。我们展示了通过“模仿”标题训练容易塑造这种偏差，但却损害了CLIP模型中预期的视觉-语言表示学习。这表明迫切需要重新审视CLIP样式模型的设计或基于CLIP分数过滤构建的现有图像-文本数据集筛选流程。

English

Despite CLIP being the foundation model in numerous vision-language applications, the CLIP suffers from a severe text spotting bias. Such bias causes CLIP models to `Parrot' the visual text embedded within images while disregarding the authentic visual semantics. We uncover that in the most popular image-text dataset LAION-2B, the captions also densely parrot (spell) the text embedded in images. Our analysis shows that around 50\% of images are embedded with visual text content, and 90\% of their captions more or less parrot the visual text. Based on such observation, we thoroughly inspect the different release d versions of CLIP models and verify that the visual text is the dominant factor in measuring the LAION-style image-text similarity for these models. To examine whether these parrot captions shape the text spotting bias, we train a series of CLIP models with LAION subsets curated by different parrot-caption-oriented criteria. We show that training with parrot captions easily shapes such bias but harms the expected visual-language representation learning in CLIP models. This suggests that it is urgent to revisit either the design of CLIP-like models or the existing image-text dataset curation pipeline built on CLIP score filtering.

鹦鹉字幕教会CLIP识别文本

Parrot Captions Teach CLIP to Spot Text

摘要

Support